WLCG Operations Coordination Minutes - May 8th, 2014

Agenda

Attendance

  • local: Nicolò Magini (secretary), Andrea Sciabà, Maarten Litmaath (ALICE), Marian Babik, Maria Dimou, Simone Campana (ATLAS), Andrej Flipcic (ATLAS), Felix Lee (ASGC), Oliver Keeble, Julia Andreeva, Michail Salichos, Zbigniew Baranowski, Stefan Roiser (LHCb).
  • remote: Josep Flix (chair, PIC), Maite Barroso (Tier-0), Valery Mitsyn (JINR), Thomas Hartmann (KIT), Shawn Mc Kee, Cristina Aiftimiei (EGI), Christoph Wissing (CMS), Peter Solagna (EGI), Di Qing (TRIUMF), Antonio Maria Perez Calero Yzquierdo, Gareth Smith (RAL), Rob Quick (OSG), Dave Dykstra, Alessandra Forti

News

  • Alastair Dewhurst replaces Simone Campana in the IPv6 task force. Thanks to Simone and a warm welcome to Alastair!
  • Discussion on ARGUS support future
  • Discussion on the mandate and the objectives of a new task force or working group on network and transfer metrics
    • End of the xrootd and perfSONAR task forces
    • open discussion on the mandate - presentation is scheduled
  • 2014 WLCG Workshop in Barcelona (7-9 July):
    • Please register asap and book your hotel: https://indico.cern.ch/event/305362/ (registration will close one month in advance)
    • The agenda is being discussed and potential speakers will soon be contacted

News from EGI

  • Peter presents about the future support of ARGUS
    • SWITCH currently supporting ARGUS on best effort, but strongly suggesting some other institution to take over in the medium term. Evaluating alternatives.
    • EGI surveying support for other products.

  • Peter explains that SWITCH has not given a firm deadline for the handover, but probably around 6 months, and it's important to start the discussion on finding alternatives.
  • Pepe asks when INFN can give an answer on the support: no timeline on the answer yet.
  • To be assessed again in next meeting

Proposal for new Working Group: Network and Transfer Metrics

  • Marian presents a proposal for a new Working Group: Network and Transfer Metrics
    • The proposed mandate is to identify and publish the metrics, make sure that issues can be better understood and fixed, enable use of network-aware tools.
    • Objectives and membership are presented.

  • Pepe asks if the WG will implement alarming mechanisms for network and transfer issues. Shawn answers that the scope of the WG would be to provide the metrics needed for this and ensure that the data is well organized. Implementing the alarms is out of scope and there are other project proposals for this.
  • Nicolo and Simone ask if the working group will also handle operational aspects previously covered by the xrootd task force (e.g. deployment and upgrade of monitoring plugins) not mentioned in the presentation. Julia suggests that the WG should ensure that the infrastructure needed to collect the required metrics is in place. Shawn answers that while the WG's role is not to fix issues, they are in charge of identifying the problems and coordinating with the developers to fix them, as mentioned explicitly for perfSonar. Agreed that checking that federation access metrics are published and coordinating the deployment of the monitoring plugins is also in scope of the WG.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

  • Security support for EMI-2 ended on April 30th, all baseline versions increased to EMI-3 except for dCache for which support was extended.
  • CVMFS bugfix release
  • gliteWMS bugfix release

Tier-0 and Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-11 and SRM-2.11-2 on ATLAS, ALICE, CMS and LHCB
EOS:
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.8 / xrootd 3.3.4 / BeStMan2-2.3.0)
CMS (EOS 0.3.7 / xrootd 3.3.4 / BeStMan2-2.3.0)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.3.0 (OSG pre-release))
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.7-3
xrootd
3.3.4-1
None None
BNL dCache 2.6.18 (Chimera, Postgres 9.3 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.3 emi3 (ATLAS, LHCb, CMS)    
FNAL dCache 2.2 (Chimera, postgres 9) for disk instance; dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) for tape instance; httpd=2.2.3
Scalla xrootd 3.3.6-1
EOS 0.3.21-1/xrootd 3.3.6-1.slc5 with Bestman 2.3.0.16
  Upgrade tape instance to Chimera/dCache 2.2 on May 19-20
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.3.4 (Alice T1), xrootd 3.3.4 (Alice T2)
   
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.24
  • srm-cms-mss.jinr-t1.ru: 2.2.24 with Enstore
xrootd federation host for CMS: 3.3.6
   
KISTI xrootd v3.3.4 on SL6 (redirector only; servers are still 3.2.6 on SL5 to be upgraded) for disk pools (ALICE T1)
xrootd 20100510-1509_dbg on SL6 for tape pool
xrootd v3.2.6 on SL5 for disk pools (ALICE T2)
dpm 1.8.7-4
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.21-1
  • cmssrm-kit.gridka.de: 2.6.17-1
  • lhcbsrm-kit.gridka.de: 2.6.17-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd redirector 3.3.3-1
   
NDGF dCache 2.8.2 (Chimera) on core servers and on pool nodes. Upgraded to dCache 2.8.2  
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.23-1
xrootd door to VO severs (3.3.4)
None dCache 2.9 tests ongoing in a test instance
RAL CASTOR 2.1.13-9
2.1.14-5 (tape servers)
SRM 2.11-1
  Scheduling upgrade to CASTOR 2.1.14
RRC-KI-T1 dCache 2.2.24 + Enstore (ATLAS)
dCache 2.6.22 (LHCb)
xrootd - EOS 0.3.19 (Alice)
   
TRIUMF dCache 2.6.21 None None

  • PIC mentions that dCache2.9 is compatible with Enstore.

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1, fts-server-3.2.3-5 Put FTS3 server into production  
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1 None None
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

  • FNAL added FTS3 to the table. To be added also for CERN and RAL; will be tracked only for servers in WLCG deployment.

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-4 SLC6, EPEL Oracle 11 ATLAS, OPS, ATLAS Xroot federations  
CERN 1.8.7-4 SLC6, EPEL Oracle 12 LHCb  

Oracle deployment

  • Note: only Oracle instances with a direct impact on offline computing activities of LHC experiments are tracked here
  • Note: an explicit entry for specific instances is needed only during upgrades, listing affected services. Otherwise sites may list a single entry.

Site Instances Current Version WLCG services Upgrade plans
CERN CMSR 11.2.0.4 CMS computing services Done on Feb 27th
CERN CASTOR Nameserver 11.2.0.4 CASTOR for LHC experiments Done on Mar 04th
CERN CASTOR Public 11.2.0.4 CASTOR for LHC experiments Done on Mar 06th
CERN CASTOR Alicestg, Atlasstg, Cmsstg, LHCbstg 11.2.0.4 CASTOR for LHC experiments Done: 10-14-25th March
CERN LCGR 11.2.0.4 All other grid services (including e.g. Dashboard, FTS) Done: 18th March
CERN LHCBR 12.1.0.1 LHCb LFC, LHCb Dirac bookkeeping Done: 24th of March
CERN ATLR, ADCR 11.2.0.4 ATLAS conditions, ATLAS computing services Done: April 1st
CERN HR DB 11.2.0.4 VOMRS Done: April 14th
CERN CMSONR_ADG 11.2.0.4 CMS conditions (through Frontier) Done: May 7th
BNL   11.2.0.3 ATLAS LFC, ATLAS conditions TBA: upgrade to 11.2.0.4 (tentatively June)
RAL   11.2.0.3 ATLAS conditions TBA: upgrade to 11.2.0.4 (tentatively June)
IN2P3   11.2.0.3 ATLAS conditions TBD: upgrade to 11.2.0.4 on 13th of May
TRIUMF TRAC 11.2.0.4 ATLAS conditions Done

  • Oracle upgrades completed at CERN. At Tier-1s, schedule depends on testing of Golden Gate (new replication method).

T0 news

  • First conclusion of meetings to measure batch jobs efficiency depending on the location in which they are executed: No indication of any significant difference of job efficiency between CERN Geneva and Wigner
    • Some other possible causes of low job efficiency have been found and are being investigated:
      • Intel versus AMD (Some batch applications optimised for Intel CPUs)
      • Zombie pilot jobs
      • Some users transfer data from/to remote locations
      • Virtual vs Bare Metal
      • SLC5 versus SLC6
    • Other actions taken: dedicated perfSonar Meyrin-Wigner, SLC6 with standard TCP parameters. Full mesh of bandwidth measurements.
    • The meetings/investigation will continue, we'll report when there are more findings.
    • Full presentation to the WLCG mgt board available here: https://indico.cern.ch/event/302033/contribution/3/material/slides/0.pdf

  • WMS decommissioning: the machines are powered off; no tickets or user complaint whatsoever. Only SAM instances are running in production.

  • Migration to VOMS-admin: waiting for the new voms-admin release, expected in around three weeks. Ticket open to the developers: GGUS:102984

  • Argus: Sporadic authentication failures have been observed in the site-argus service; apparently some internal timeouts were triggered. We increased the number of nodes in the site-argus alias, which seems to have improved the situation; the root cause is being investigated.
    • 3rd level support is best effort and has an uncertain future; This is bad news for us, Argus is an important piece of the grid middleware stack. We would like to raise it to WLCG's attention.

  • Maarten suggests to rephrase the statement about Meyrin vs Wigner job efficiency to clarify that no significant difference is seen on job efficiency based purely on the location.
  • Nicolo asks about the status of the SLC6 migration. Maite answers that the progress is currently 70% SLC6, +5% since the last report three weeks ago. At 80%, PES will discuss with experiments if/how much needs to be kept on SLC5.
  • Maite explains that the mails about zombie pilots are not yet sent automatically.

Other site news

Data management provider news

DPM 1.8.8 released

DPM 1.8.8 has been released to EPEL-stable. Highlights and full details available here: https://svnweb.cern.ch/trac/lcgdm/attachment/wiki/Dpm/DPM_releasenotes_Mar2014.txt

Storage Infosys publishing

A series of meetings involving many storage providers (Castor/EOS, dCache, DPM, StoRM) has been initiated as part of the validation of the information system. The aim is to ensure consistent, complete and correct publishing of storage systems to GLUE2, in particular relating to capacity publishing. https://twiki.cern.ch/twiki/bin/view/EGEE/GLUE2Storage

Experiments operations review and Plans

ALICE

  • High activity in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • KIT
    • smooth sailing since 2 weeks
  • CERN
    • SLC6 job efficiencies: next meeting tomorrow

ATLAS

  • MC production:
    • lower activity in the last two weeks, waiting for the new requests to be approved and submitted
    • multi-core MC14 production done, only smaller validation tasks running from time to time
  • FTS3 upgrade went smoothly without ATLAS intervention
  • Rucio stress test planned to start after 20th of May and continue with gradual increase of activity till end of June
  • Rucio commissioning:: migration to Rucio file catalog to be completed in the next two weeks
  • DQ2 site services issues: waiting for a DQ2 fix for uncaught exceptions affecting the stability of service
  • CERN low efficiency for jobs:
    • pilot wrapper implemented a cleanup of orphans processes by process group control
    • in production, the problematic tasks at CERN were activated again
  • Analysis and parallel make:
    • sites and users complained on build jobs failures - asetup automatically sets MAKEFLAGS to number of available cores
    • pilot wrapper fix - MAKEFLAGS are reset after the setup to the number of allocated cores to the batch jobs
  • Multi-core production:
    • no massive tasks planned till release 19 is ready for simulation and reconstruction
    • sites asked to reduce the multi-core partition in case of static single-core/multi-core allocation

  • Andrej clarifies that ATLAS will continue to send a mixture of single-core and multi-core jobs in the future.

CMS

  • High priority Production and Processings
    • Heavy Ion reprocessing (now finished)
    • Heavy Ion MC
    • Upgrade MC
  • Oracle upgrade of CMS Online DB went ok yesterday
    • Had some CMS internal communication deficit that all Grid jobs would be impacted (Blame CW for this!)
    • CERN DBAs and CMS Frontier experts found a way to perform the upgrade transparently (Thanks!)
  • SAM tests
    • Make CMS SAM test for glexec critical on May 15th
      • Open tickets to sites that would fail end of this week
  • SAM test for xrootd fallback
    • Not yet critical
    • Still waiting (mainly) for RAL to fix some issues
  • Xrootd Federation - “AAA”
    • Scale testing of Tier-1 ongoing
    • Reminder to sites: Please deploy detailed xrootd monitoring
  • Multi core processing
    • Started to send production workflows through mixture of multi-core and single-core pilots
      • Executing N single-threaded jobs in N-core pilot (N typically 8)
      • First successful experiences at PIC (where we started first)
      • Ramping up at KIT, RAL and JINR
    • Functional tests at other sites continuing or about to start
  • FTS3 for Phedex Debug transfers becoming mandatory now
    • Will send tickets to sites this week
  • Problem in OSG DigiCert CRL on May 1st
    • SAM tests failed for US sites
    • Sites readiness metric already corrected
    • Due to some issue on the CMS SAM-Nagios box changes only picked up properly on Monday

  • On Andrea's question, Marian comments that a ticket for SAM availability recalculation for the CRL incident is already open.
  • Simone asks other VOs about their strategy for availability recalculation in case of failures or timeouts in submission of SAM jobs through gliteWMS, which do not necessarily affect production jobs. Andrea answers for CMS that the recalculation is requested if the problem is in the infrastructure, but not if it is at the site; it needs to be investigated case by case. Such failure modes are expected to be reduced with the new CondorG and CREAMCE SAM submission probes.
  • Julia comments that the new SAM will allow to include production efficiency in availability calculations if desired by each VO, e.g. by taking the best of production and SAM job efficiency as the site availability. Can continue to use SAM jobs to probe individual services, but not penalizing the site if production is OK.

LHCb

  • Incremental stripping campaign finished, all productions closed, many thanks to all Tier1 sites for their support
  • CASTOR->EOS migration of LHCb user data finished, all CERN permanent storage for LHCb distributed computing now allocated at EOS, many thanks to DSS for the migration
  • Problem with some certificates especially for Brazilian VO members to access data at GRIDKA and IN2P3
    • similar problem last fall at other DCache sites, fixed by update (special characters in certificates) fixed by update, this time seems to be different
    • investigations ongoing

  • Maarten comments that the certificate issues are probably in some library used by dCache.
  • On Pepe's question, Stefan answers that PIC is not affected this time unlike the last problem.

Ongoing Task Forces and Working Groups

Tracking tools evolution TF

  • GGUS: Proposal to stop ticket creation through email. More info. At the moment, this feature can cause the creation of a lot of fake tickets. There are not that many real tickets opened through email.
  • CMS Comp Ops started the transition from Savannah to GGUS. At some point, the GGUS to savannah bridge should be stopped. The current plan is to stop it at the end of June more details

FTS3 Deployment TF

gLExec deployment TF

  • Cristoph comments that very few CMS sites are still missing glExec deployment (a couple of Tier-2s and some Tier-3s).

Machine/Job Features

  • Main activity on development and soon deployment of a machine/job features service for a cloud infrastructure
    • "API" agreed on the level of the URL / GET parameters to be consistent. This allows
      • different implementations for the IaaS service
      • clients to go for direct access instead of using mjf.py client (if wanted) as only two modes (batch/IaaS) need to be implemented
  • Note for sites, the mfj.py client does not need to be deployed for what concerns LHC VOs as all of them will bring the client with them in their software stack (cvmfs etc)

  • Stefan comments that sites are welcome to volunteer for testing the service for a cloud infrastructure.
  • Alessandra reminds that not all VOs are already shipping the mjf client, but they are planning to do so. It is available in the SFT CVMS repository.

Middleware readiness WG

  • Next meeting on Thu May 15, 10:30-12:00 CEST
  • As promised at the WLCG Planning meeting of 2014/04/17 we addressed the Tier0 and Tier1 contacts the following questions, to which we get very prompt and useful reponses that will be on our MW Readiness WG twiki in time for next week's meeting. Many thanks to everyone!
    1. If, How and Where you publish the MW versions you run in production.
    2. How you use the Baseline versions' table given that the "baseline version" number doesn't necessarily reflect all individual updates of the packages in the dependencies.

Multicore deployment

  • CMS started scale tests of multicore pilots to Tier1s at PIC, KIT, RAL, JINR and CCIN2P3.
  • We enter now in a second stage for the TF, where we will actually start to evaluate the compatibility of ATLAS and CMS approaches to submitting multicore jobs to shared sites.
  • Coming sessions will be dedicated to present and discuss this experience from the sites point of view.

  • Simone asks if the TF has provided recommendations for all batch systems for dynamic resource provisioning. Alessandra and Antonio answer that they will be provided after gathering more experience with simultaneous ATLAS and CMS running. Simone offers dedicated ATLAS test tasks for this.

SHA-2 Migration TF

  • EGI broadcast #2 about the new VOMS servers was sent on May 6
  • a problem with the timeline was discovered on May 7:
    • job submission to CREAM fails when the proxy was signed by a VOMS server with a SHA512 host certificate (GGUS:104768)
    • our new VOMS servers have such certificates
    • the fix has been tested successfully
    • we now need the fix to become available in EMI and UMD repositories
    • all sites then need to update their CEs
    • June 2 looks a bit tight...

  • Simone asks if there is also a plan to migrate to RFC proxies, now of interest for ATLAS, and if an update can be provided at the next meeting. Maarten answers that it shouldn't be an issue, as CMS has already started using them; he suggests to switch the SAM preprod instances to RFC proxies.
  • Nicolo confirms that CMS has found no blocking issues with RFC proxies, though some sites needed to upgrade the services (e.g. BeStMan) to recent versions.

WMS decommissioning TF

  • CERN WMS instances for experiments have been switched off on May 5
  • SAM instances have their own timeline

IPv6 validation and deployment TF

  • Today, meeting of the HEPix IPv6 working group, focused on the preparation of the June pre-GDB
  • Quoting today's email from Edoardo Martelli:
    • It's with a great pleasure that I can introduce you lxplus-ipv6.cern.ch, an lxplus instance with dual-stack connectivity. You can ssh to it over IPv6 or IPv4 from anywhere on the Internet. I'd like to thank Steve Traylen, Ignacio Reguero and the IT-PES group for making it possible.

HTTP proxy discovery TF

  • For reference: TF home page
  • Progress has been slow, but some has happened:
    • It is mostly waiting on full implementation of the SquidMonitoringTaskForce recommendations. For that, the GOCDB & OIM fields for registering squids have been defined, and the wlcg-squid-monitor.cern.ch machine is reading them and putting the list into a JSON file. However, the MRTG monitor that reads the file isn't complete so it's too early to ask all sites to register their squids.
    • The wlcg-wpad.cern.ch name has been defined as an alias to wlcg-squid-monitor.cern.ch for now, but wpad.dat is not yet being generated from the JSON file.
    • The frontier client has been fully ready to read WPAD/PAC files for one year now. The cvmfs 2.1.19 client implementation is almost complete except it doesn't robustly support all http proxies in a round-robin, which is the only way to do load balancing in lists of proxies specified in PAC files; otherwise the proxies listed are tried sequentially.

Action list

  1. Document procedure and forum to track networking issues
    • In the mandate of the Network and Transfer Metrics WG

AOB

-- NicoloMagini - 05 May 2014

Edit | Attach | Watch | Print version | History: r36 < r35 < r34 < r33 < r32 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r36 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback