WLCG Operations Coordination Minutes, September 3rd 2015

Highlights

  • CMS requested to add the Xrootd version for the AAA Redirectors to the baselines. Requirements are being understood
  • slapd process crashing sometimes after upgrading to SLC6.7/CentOS 6.7 which includes a new version of openLDAP ( openldap-servers-2.4.40-5 ). WLCG suggests sites not to upgrade to this version of openLDAP
  • dCache vulnerability (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-9323) that affects gsi and kerberos ftp doors. All versions of dCache prior to 2.13.7, 2.12.19, 2.11.30 and 2.10.39 are affected. The baselines for dCache have been upgraded and sites are suggested to upgrade to the latest versions containing the fix
  • EGI will move to RFC proxies as of October or November
  • OSG perfSONAR dashboard (http://psmad.grid.iu.edu), which is already connected to the OSG datastore, is already showing up to date content

Agenda

Attendance

  • local: Maria Alandes (Minutes), Maarten Litmaath, Andrea Sciaba, Andrea Manzi, Marian Babik, David Cameron, Giuseppe Lo Presti
  • remote: Alessandra Forti (Chair), Antonio Maria Perez Calero Yzquierdo, Christoph Wissing, Frederique Chollet, Felix Lee, Maite Barroso, Michael Ernst, Peter Gronbech, Renaud Vernet, Ult Tigerstedt, Rob Quick, Thomas Hartmann, Alessandro Cavalli, Vincenzo Spinoso, Pepe Flix, Alessandra Doria
  • apologies: Stefan Roiser

Operations News

None

Middleware News

  • Baselines:
    • CMS requested to add the Xrootd version for the AAA Redirectors to the baselines. What are the requirements for CMS ? what about ATLAS for FAX?
  • Issues:
    • on nodes running BDII both CERN and Nikhef reported issues with slapd process crashing after upgrading to SLC6.7/CentOS 6.7 which includes a new version of openLDAP ( openldap-servers-2.4.40-5 ). Dennis from Nikhef has opened a ticket to RedHat with the crash details. While the issue is under investigation, we suggest sites not to upgrade to this version of openLDAP.
    • dCache vulnerability broadcasted by EGI SVG. (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-9323). It affects gsi and kerberos ftp doors. All versions of dCache prior to 2.13.7, 2.12.19, 2.11.30 and 2.10.39 are affected. The baselines for dCache have been upgraded and sites are suggested to upgrade to the latest versions containing the fix.
    • The message for sites regarding the globus-gssapi change has been prepared together with Maarten ( https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions#Globus_GSSAPI_Name_compatibility). Will be broadcasted by EGI first in few days. We are also trying to prepare a script to monitor the affected services.
  • T0 and T1 services
    • IN2P3 and JINR upgraded to dCache 2.10.39 because of the reported vulnerability, and for the same reason NDGF upgraded to 2.13.7
    • RAL updated to SL6 the disk servers in disk-only service classes and plan to do the same for disk servers in tape-backed services classes.

Tier 0 News

None

Tier 1 Feedback

None

Tier 2 Feedback

None

Experiments Reports

ALICE

  • high activity
  • CERN
    • continued intermittent issues with accessing CASTOR for reading or writing raw data files
      • the CASTOR team applied mitigations on various occasions, thanks!
      • a more robust solution is eagerly awaited...
    • CVMFS Stratum-0 interface machine down for a few hours last Fri (INC:0846886)
      • fixed on time for the weekend, thanks!

Christoph Wissing reports that CMS also experiences problems between Point 5 and Wigner and he wonders whether these problems could be the same as the ones reported by ALICE. Maarten explains that ALICE has problems between the pit, in particular the DAQ system, and the Computing Center in Meyrin, since ALICE writes directly to CASTOR, so he is not sure whether the problems could be the same. Giuseppe explains that CMS is writing to EOS first, this is the reason why they go through Wigner, since all CASTOR servers are indeed in Meyrin. Andrea Manzi refers to link congestions reported in the SSB, affecting CMS EOS transfers.

ATLAS

  • Usual high activity, especially in analysis last week in run up to LHCP conference (70k slots)
  • T2s were requested to change analysis share from 50% to 25% since ATLAS runs centralised derivation production for analysis
  • Issue with rfc/non-rfc proxies on pilot factories caused explosion in logs on CREAM CEs

Maarten explains that it is known that broken proxies create a verbose log file although this is a very unusual condition. It is good for the sites to be aware of this particular situation.

It is agreed to add an action item for the change of the analysis share.

CMS

  • Rather high activity since a week
    • Reached ~120k parallel jobs in the CMS Global Pool
  • Multi-Core accounting
    • Production portal does not support it (as of today)
    • Data from the development portal partly usable
  • Followup of Tier-0 transfer issues to Wigner
    • Investigated by IT experts
    • Some router hardware being exchanged (in chunks)

It is agreed to put an action item on the Multi-core accounting portal issue, since this has been pending now for a long time. John Gordon will be contacted to schedule a presentation at the next Ops Coord meeting to understand the status of the multi-core accounting portal and the plans to have this ready as soon as possible.

LHCb

  • Data Processing
    • Validation and data quality verification of 25 ns data finishing, production processing likely to start soon. All data is buffered on disk resident areas -> no staging
  • Operations
    • CNAF outage affecting LHCb storages
    • Ongoing discussion with IT/PES about worker nodes which are executing payloads significantly slower (e.g. GGUS:116023)
  • Development
    • interface for HTCondor submission ongoing
  • Questions
    • Is it possible to do a live migration of a vobox VM to another hypervisor instead of shutting it down?
    • response from the CERN cloud team: Live-migration is possible, with constraints. For example, it is limited to VMs w/o attached volumes, and only admins can execute it. In general: please contact us in cases where our scheduled interventions cause problems, so we can discuss how to accommodate you.

Maite will follow up on the vobox migration to a new VM. It is mentioned that this was already addressed at the 3PM meeting and that Stefan agreed to open a ticket for this action in any case.

Christoph asks whether the slow worker node issue could be related to the same issues experienced by ATLAS or CMS. Maarten reminds that this is all part of the investigations being done by the Job Efficiency task force that is meeting the next day, so suggests to continue the discussions there.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA
    • the HC tests cover all sites that currently have gLExec (61 out of 94)
    • the latest PanDA pilot released on Mon fixes an issue experienced at BNL
    • more news when Eddie is back

HTTP Deployment TF

Information System Evolution


  • REBUS known issues have been either fixed or are in the to do list of REBUS maintainers.
  • Many action items are put on hold until Information System Use Cases presented at the MB
  • Draft document describing use cases should be ready on Monday 7th September. It will be presented at the MB on 15th September
  • Update on Information System Status also scheduled at next GDB on 9th September

IPv6 Validation and Deployment TF


Machine/Job Features

  • SAM probe for checking existence of MJF in pre-prod for LHCb. Shows that currently 4 sites have MJF installed.
  • Presentation about status of TF in GDB next week

Middleware Readiness WG


  • Not so much to report about verifications cause not many MW versions were made available during the summer
  • Development of the MW Readiness App continued in the past weeks, with graphical enhancements to the site, ssb integration, access to host packages, dev version is at https://mw-readiness-dev.cern.ch/ ( CERN only)
  • We would like to remind to volunteer sites that the deadline to move to the new pakit-client + server conf has passed ( end of August) and we still miss some upgrades. See instructions sent by mail in July at:
  • An agenda of our next (16/9 at 4pm CEST) meeting is on page http://indico.cern.ch/e/MW-Readiness_12. Please send additional items to the e-group wlcg-ops-coord-wg-middleware at cern...
  • An update on the WG activities will be presented during the next week GDB.

Multicore Deployment

Network and Transfer Metrics WG


  • Meeting held yesterday, 2nd of September https://indico.cern.ch/event/393102/
  • OSG enabled publishing of the perfSONAR results to the netmon-test-mb.cern.ch from the ITB collector service today. Production setup is still pending SLA.
  • OSG perfSONAR dashboard (psmad.grid.iu.edu), which is already connected to the OSG datastore already showing up to date content.
  • MadAlert - new project to analyse meshes and report infrastructure issues vs network problems already reporting from psmad (MadAlert http://maddash.aglt2.org/madalert.html).
  • perfSONAR operations status
    • Latency mesh: 81 sonars (94% efficiency)
    • Traceroute mesh: 112 sonars (90% efficiency)
    • perfSONAR 3.5rc2 was released yesterday and will be auto-deployed to all testbed instances, one issue with Postgresql reported from UC instance

Rob explains that the SLA is now circulated internally for review.

RFC proxies

  • SAM
    • a new version of the proxy renewal rpm is pending
    • EGI will move to RFC proxies as of Oct or Nov

Maarten explains that the old voms-proxy-init client (non java) has legacy proxies as the default option. The newest version (java) is intelligent enough to do the right choice (based on the type of the input proxy) and generates the right proxy. This is not included in the c++ version though, where a type needs to be chosen. It could be requested to the developers to include the same logic as in the java client. In any case, voms-proxy-init needs to support legacy proxies until it is established that the RFC type breaks only stuff that is not officially supported.

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Action list

Creation date Description Responsible Status Comments
2015-09-03 Status of multi-core accounting John Gordon ONGOING A presentation about the plans to provide multicore accounting data in the Accounting portal should be presented at the next Ops Coord meeting since this is a long standing issue
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. A broadcast message explaining the problem is ready and will be sent soon by EGI first.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a A presentation about the support structure of each experiment will be done next meeting July 30th. Extended to September 3rd DONE

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-09-03 T2s are requested to change analysis share from 50% to 25% since ATLAS runs centralised derivation production for analysis ATLAS - - Unknown ONGOING
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. DONE
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

-- MariaALANDESPRADILLO - 2015-09-02

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback