WLCG Operations Coordination Minutes, January 21st 2016

Highlights

  • It is reminded to sites and experiments that CC7, SL7 and CentOS7 are compatible distributions and that there is no problem in using one or another.
  • AFS team at CERN is interested in collecting feedback from experiments who suffered from the AFS outage OTG:0027970 on 18-19.01 affecting any critical workflows.
  • ETF Nagios will move to RFC provies on 01.03.2016. Validation is currently ongoing.
  • CMS sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6.
  • ATLAS Sites Jamboree taking place on Wednesday 27th to Friday 29th January. Sites should register if they plan to attend.

Agenda

Attendance

  • local: Maria Alandes (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Alessandro Di Girolamo (ATLAS), Oliver Keeble, Andrea Manzi, Marian Babik, Raja Nandakumar (LHCb)
  • remote: Alessandra Forti (chair), Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Felix Lee, Jeremy Coles, Pepe Flix, Jeremy Coles, Stefano Belforte, Catherine Biscarat, Vincenzo Spinoso, Frederique Chollet.
  • apologies: Maria Dimou (MWR WG)

Operations News

  • GGUS support is returning to the care of Maria Dimou.

Middleware News

  • Useful Links:
  • Baselines/New releases:
  • Issues:
    • globus-gssapi change for hostname verification, the new behaviour that we are discussing since long time now, is going to be released the 1st April.
    • openldap crash in TopBDII and ARC-CE resource BDII. RedHat is going to release the fix in RHEL 6.8 (To be scheduled). The same openldap version causing the issue is now available in CentOS7…likely we should have the same issue there so we have asked to include the fix also in the next version of RHEL 7
    • GGUS:118842, gfal-cat fails with Castor, issue discovered when using the ATLAS tool which collects storage dumps
  • T0 and T1 services
    • ASGC
      • CASTOR decommissioned
    • CNAF
      • installed the last production version of the storm-webdav service on the lhcb gridftp servers. Updated srm servers certificates to include alternative names used to contact them
    • CERN,RAL and BNL
      • Dev suggested DB change in order to fix a problem on Rucio polling has been applied
    • NDGF
      • dCache upgraded to v 2.14.8

Alessandra asks whether the situation with CC7, SL7 and CentOS7 is clear, in the sense that they are different distributions and sites may be confused on which OS they have to install. Maarten explains that there is no difference as they are compatible and that this has always been the case in the past already with SL, as the SL distribution in Fermilab and the one at CERN were not exactly the same. The important thing is that all these distributions are compatible. Maarten adds that it is very unlikely that a package is built with a particular OS dependency that is only available in one of the distributions. In this case, it will be discovered and the dependency will have to be explicitly declared. It's not a big problem. Andrea reminds that the MW verification is always done on CentOS7. Maarten explains that also in the past this happened since verification was done in SLC. Some inconsistencies were found at the time between SL and SLC and they were corrected.

Tier 0 News

NTR.

DB News

Tier 1 Feedback

  • NDGF-T1 : Alice disk storage is really full, causing problems. (Ulf can't attend, I'm in a meeting at CERN at the same time)

Tier 2 Feedback

Experiments Reports

ALICE

  • mostly high activity
  • disk space
    • regular and ad-hoc cleanups ongoing
    • policy changes under discussion with the physics groups
    • CASTOR: the old disk servers remain available until April - thanks!

ATLAS

  • Activities running smooth during the past 2 weeks. Stable around 200k running slots.
  • Discovered an issue on the Reprocessing data produced over the xmas break. The issue has been now fixed, the new tasks will be most probably submitted end of this week. Last round data will be most probably deleted starting from tomorrow (waiting for green light from DP)
  • last WLCG Ops Coord meeting https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes160107#ATLAS we reported about an issue with MadGraph jobs which were producing huge log files which were creating troubles to WNs. This was not the only problem they created, we discovered that they also created quite a lot of dark data >1PB (i.e. data on storage but not recorded in Rucio) because the pilot was taking hours to tar the output logs, not sending updates to pandaserver which then considered the job as dead. This will be fixed in the next pilot release.
  • To all the ATLAS sites. ATLAS Sites Jamboree Wed-Fri 27-29January. https://indico.cern.ch/event/440821/ . Please register if you plan to attend.

CMS

  • Tier0/PromptRECO
    • CMS took much more Heavy Ion data last year than PromptRECO capacity would allow
    • Had a rather long backlog of "Tier-0/PromptRECO" jobs
    • Backlog gone since early this week
  • Continue to have high to very processing and production load
    • More than 100k parallel production jobs at most of the time
    • Reprocessing of 2015 data progressing well
  • Requests for sites
    • All sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6. Please note that PhEDEx version 4.1.6 actually doesn't exist.
  • Operational issues
    • Kerberos/AFS problem at CERN affected user and various CMS services Alarm ticket - GGUS:118938
    • Some Kibana based monitoring from CERN-IT has been (is being) fixed

Maarten reminds that at the 3PM meeting today, the AFS service manager asked for feedback on critical workflows affected by the AFS outage.

LHCb

  • Activities :
    • Mostly user and MC jobs running on the grid
    • Started pre-staging for turbo-calibration.
  • Information
    • Restripping imminent - use pre-staged data from
    • Thanks to CNAF for enabling MJF. Look forward to other sites also enabling it.
    • Other site issues handled either by GGUS tickets or internally.
      • SARA srm problems

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

Machine/Job Features TF

HTTP Deployment TF

  • All relevant sites have now been ticketed (around 40)
  • Meeting on 20th Jan was canceled
  • TF will continue to iterate on the remaining tickets

Information System Evolution


  • Preparation and discussion of the slides to be presented in the WLG workshop.
  • New Execution Environment service in GOCDB/OIM to give logical CPUs and Benchmark information of the resources in a site:
    • Discussion with GOCDB developer to understand whether a new Execution Environment service could be added to GOCDB. The answer is yes but there is no writeable REST API for the time being. Feedback being collected from sys admins to understand advantages and disadvantages of having this new service defined in GOCDB.
    • OSG is partially providing the needed information (Benchmark) already. They are planning to add HS06 normalisation constant to be able to derive the number of Logical CPUs from there (Logical cores = (total hs06 / hs06 normalization)
  • After the WLCG workshop we hope to have more clear directions on next steps inside the TF, especially for the new IS, that for the time being is on hold.

Alessandro reminds that IS TF should align with any definitions done in other TFs like MJF. Maria reminds that Andrew McNab is making the link between the two TFs and brings in the discussion any relevant information that also affects MJF. Moreover, at the WLCG workshop a joint session between IS, Accounting and benchmarking will take place to discuss common issues.

IPv6 Validation and Deployment TF


Middleware Readiness WG


The JIRA dashboard shows per experiment and per site the product versions pending for Readiness verification. Changes since the Ops Coord. meeting of Jan. 7th are:

Multicore Deployment

Network and Transfer Metrics WG


  • WLCG Network Throughput SU: GGUS-118730 Throughput degradation between CA and EU
    • Root cause was instability of the transatlantic link (WIX reported submarine shunt fault), which in turn impacted Geant- CANARIE link.
    • perfSONAR network helped to identify the problematic segment and once Canarie was notified the issue was resolved by re-routing.
    • Issue was reported by ATLAS, but many different people were involved (ATLAS, TRIUMF, perfSONAR support, LHCONE, Canarie, WIX).
    • Multiple GGUS tickets were open, but only one was followed up, something to improve in the future.
    • Experiments: Please check if everyone was notified of the on-going incident and let us know if we need to add additional contacts (wlcg-network-throughput mailing list)
  • OSG perfSONAR production services: Storage failure (OASIS) at GOC has impacted the entire perfSONAR pipeline, initially just the datastore, but later on also collector and publisher. The issue was resolved yesterday and the systems are recovering now. We have proposed changes that would remove dependency on the shared storage.

RFC proxies

  • SAM
    • new ETF Nagios preprod hosts are using RFC proxies for ALICE, ATLAS and LHCb
    • also agreed for CMS
    • comparisons with production still to be done, but mostly to check other changes
    • tentative date for production: March 1

Alessandra asks whether there is any objection to the proposed date. Maarten explains that all experiment contacts for SAM been informed and it's looking OK so far. Marian explains that validation is still ongoing and that he could do a short presentation at the next meeting to give more details. In any case these changes should be transparent.

Squid Monitoring and HTTP Proxy Discovery TFs

  • Alastair Dewhurst has finished the implementation of a flexible exception list for squids to monitor and just needs to make it available for ATLAS & CMS to use
  • Vassil Verguilov will next fill in exceptions known to CMS and generate a CMS-specific MRTG monitoring page using the CMS Sitedb to translate from the names in GOCDB & OIM into the TN_CC_Site format that CMS uses to name sites.

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING Host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit when the change finally comes in Globus 6.1, now foreseen for April 1 (sic). On Jan 21 there are only 2 tickets still open: GGUS:117043 for CNAF (largely done) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets.
2015-12-17 Recommend site configurations to enforce memory limits on jobs   ONGOING 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
22.01.2016 Provide feedback to AFS service managers at CERN on whether the AFS outage OTG:0027970 that happened on 18-19.01 affected any of their critical workflows All - AFS team at CERN is reducing the dependencies and usage of AFS and is collecting existing use cases that are critical for experiments. The outage is a good opportunity to discover unknown use cases - ONGOING

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - - - ONGOING

AOB

-- MariaALANDESPRADILLO - 2016-01-19

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback