WLCG Operations Coordination Minutes, March 1st, 2018

Highlights

Agenda

Attendance

  • local: Dimitrios (WLCG), Ivan (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Mayank (WLCG), Panos (WLCG), Shawn (AGLT2 + networks), Vladimir (tape storage)
  • remote: Catherine (LPSC + IN2P3), Christoph (CMS), Daniele (CNAF), David M (FNAL), Di (TRIUMF), Eric F (IN2P3-CC), Felix (ASGC), Frederique (LAPP), Gareth (RAL), Giuseppe (CMS), Igor (NRC-KI), Jeremy (GridPP), Kyle (OSG), Marcelo (CNAF), Peter (Oxford)
  • apologies:

Operations News

  • The next meeting is planned to be on April 12th
    • Please let us know if that date would pose a significant problem

Middleware News

  • MW Officer change
    • After fulfilling the role of MW Officer for more than 3.5 years, Andrea Manzi is moving on to new responsibilities.
    • We thank Andrea for his work in that role and would like to appoint a new MW Officer as soon as possible.
    • The MW Officer tasks are outlined here.
      • They mainly concern MW used at WLCG sites in EGI.
      • They ought not take more than 20% of an experienced person's time.
      • The MW Officer does not have to be an expert in all the MW.
      • Significant experience in WLCG operations is very desirable.
    • Please let us know about potential candidates, thanks!

  • Useful Links
  • Baselines/News
  • Issues
  • T0 and T1 services
    • From now on we will only collect such information on special occasions
      • E.g. to track an important upgrade campaign

Discussion

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • The activity levels have been normal on average
    • Lowish recently - new productions are in preparation
  • Central services
    • Jobs were affected by fallout from a few instabilities - fixed
  • CNAF recovery
    • Disk SE back in production since Feb 23.
    • So far the data files that should be present look fine.
    • A big cleanup should get the SE contents back in sync with the catalog.
    • Computing resources are also looking good.
    • We thank and congratulate CNAF with the success of their careful efforts!

ATLAS

  • Stable grid production over the last weeks with up to ~300-350k concurrent running job slots, including the HLT farm and up to ~400k and higher peak concurrent running job slots including HPCs.
  • The large MC digitisation and reconstruction campaign has finished beginning of February and the 2nd half of the data17 reprocessing from RAW is on-going since beginning of February, which causes high staging activities from tape and high data transfers throughout the whole grid.
  • Several operational problems with the RAL FTS server in the past weeks, has forced us to not use the server for data transfers anymore.
  • CNAF/INFN-T1 update:
    • Since about 1.5 weeks there are successful high access rates to INFN-T1_DATATAPE
    • Since beginning of this week there are successful high access rates to INFN-T1_DATADISK
    • Both areas are used in MC production again
    • HammerCloud production and analysis functional tests at the INFN-T1 farm are successful since this Wednesday and the queues will be used in production again.
    • Thanks a lot to the CNAF/INFN-T1 team for the recovery !
  • Rucio (data management) community workshop today and tomorrow.
  • ATLAS sites jamboree will happen next week, March 5-7th.

CMS

  • CMS getting ready for data taking
    • Global cosmic RUN on March 5th

  • Resources utilization: Rather high load in CMS Global Pool
    • typically over 200k cores for production/analysis

  • CVMFS issue due to reached 8TB quota limit
    • CVMFS deployment became slow INC:1605471
    • Continuing to increase the quota cannot be a long term solution

  • Tier-0
    • resources on Openstack share are migrating to CERN HTCondor pool
    • issue with Tier-0 redirectors, now solved (GGUS:133010)

  • CASTORCMS: several problems have been sorted out

  • CNAF: Almost back in production
    • Job submission enabled
    • Enabling gradually PhEDEx agents due to huge backlog
      • 50% of disk deletions (still 600TB to go), then restart other agents

  • Singularity deployment is proceeding well
    • 60% of wall clock time last week was used via Singularity
    • currently ~100 CEs with Singularity
    • 20 CEs installing it
    • ~70 CEs without Singularity.

Discussion

  • Christoph provided more details about the CVMFS issue:
    • The main repository was set up in the old way that only allows additions
    • When many old files were replaced over time, the actual disk usage grew
    • The setup needs to be changed to allow garbage collection
    • That will require 1 or 2 days of downtime, so needs to be scheduled carefully

  • Julia: will the SAM Singularity test become critical by the end of March?
  • Giuseppe: yes; possibly earlier if the majority of sites are OK
  • Catherine:
    • All French T2 sites are OK, but the T1 will be late
    • The T1 WN will be upgraded to CentOS7 + Singularity on March 13
  • Giuseppe: that timeline is OK for CMS

LHCb

a posteriori:

  • CNAF:
    • Started running Simulation jobs after LHCb queue being enabled
    • Waiting to be able to stage data to finalize productions
  • Production:
    • Finished Stripping28r1
    • Started Stripping29r2
  • CERN/T0 problem with updating DBOD
    • Intervention was to upgrade db services from RH6.7 to CC7.4
    • High load problems were experienced which were related to CC7/NFS bug
    • Linux team upgraded to the latest kernel available and haven't experienced problems since

Ongoing Task Forces and Working Groups

Accounting TF

  • WLCG Accounting Task Force meeting in the end of January reviewed the situation with HTCondor accounting. More effort is this area is required. Apel plans have been presented.
  • Next WLCG Accounting Task Force meeting will take place next Thursday. Experiments are kindly asked to provide input regarding RRB reports preparation. What they use, whether some improvements can be made in the central accounting systems to help experiments with RRB reports.
  • ATLAS reported some issues with CERN accounting data in the EGI portal. Under investigation
  • The progress with the Storage Space accounting presented at today meeting

Archival Storage WG

See the special topic on tape metrics

Information System Evolution TF

  • NTR

IPv6 Validation and Deployment TF

Detailed status here.

  • Kyle described the current state of affairs in OSG
    • The transition is not centrally monitored
    • T2 admins were asked to describe the situation at their sites
    • For CMS there are 4 sites OK, 1 in progress and 2 stuck in some way
    • For ATLAS the sites are in various states, often coupled to their campus plans
  • Shawn:
    • For perfSONAR the situation generally is OK
    • At AGLT2 the plan is to make dCache dual-stack next week
  • Maarten: we will ask Andrea to add the OSG sites to the tracking table

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • perfSONAR 4.0.2 and CC7 campaign - 190 instances updated to 4.0.2; 64 instances already on CC7
    • WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review the firewall port openings
    • perfSONAR 4.1 release, planned in Q2 2018 will no longer ship SL6 packages
  • WLCG/OSG network services
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
  • LHCOPN/LHCONE Workshop will take place next week - update on WG activities will be presented (https://indico.cern.ch/event/681168/)
  • perfSONAR developers F2F meeting will take place next week in Amsterdam - feedback from OSG/WLCG will be presented

Squid Monitoring and HTTP Proxy Discovery TFs

  • NTR

Traceability WG

Container WG

Special topics

CNAF recovery update

  • Daniele Cesini described the current situation and plans recorded on this week's ops page
    • A new, independent problem with storage HW is affecting various services
      • In particular the LSF shared file system and the CMS PhEDEx instance
      • The LSF batch system and the PhEDEx service are thus down
      • External experts need to come in to get the problem fixed
      • A downtime has been declared until Monday afternoon
    • ALICE and ATLAS services are in production
    • CMS is fine except for PhEDEx
      • Singularity will only be deployed on CentOS7 WN, currently only at CINECA
      • The WN at CNAF will be upgraded to CentOS7 in the near future
    • For LHCb the situation is as follows:
      • The old SE is available like it was before the flooding
      • It should be decommissioned this month, though
      • It may thus be better to copy the data to the new SE first
      • We will agree a plan with LHCb

SAM recalculation policy draft

presentation

  • Julia presented the proposal described in the slides
  • There was a short discussion during which Maarten summarized the essence:
    • A/R reports are a useful tool
    • As the SAM machinery is complex, there will be occasional glitches
    • We should tolerate such glitches if they are below reasonable thresholds
    • The proposed thresholds are given in the presentation
  • There were no objections in the meeting
  • Please let us know if there would be a serious issue with the proposal
  • We plan to ratify the final proposal in the next meeting

Tape metrics in WLCG Storage Space Accounting prototype

presentation

  • Dimitrios presented his slides on tape metrics in WLCG Storage Space Accounting prototype
  • Julia further explained the ideas behind the reports:
    • Currently only the T1 sites inject storage accounting data into REBUS
    • We want such data collection to be automatic and also include the T2 sites
    • Brian Bockelman pointed out the need for validation by site admins
    • Such functionality will require appropriate authN and authZ mechanisms
    • CRIC already has those in place and can easily be extended for this purpose
    • Ultimately the summaries will go into accounting and RRB reports

  • Next Vladimir spoke about the request for T1 sites to send their tape metrics data
    • Currently only BNL, KIT, NRC-KI and PIC are sending such data
    • David M (FNAL): we need to look into sending only the data concerning CMS
      • Vladimir: you may be able to start from the Enstore scripts used at PIC
    • Eric F (IN2P3-CC): we need to make it work for HPSS
      • Vladimir: you may be able to start from the HPSS scripts used at BNL
    • Gareth (RAL): will follow up
    • Di (TRIUMF): ditto
    • Daniele (CNAF): ditto
    • Julia: at the next WLCG Operations Coordination meeting we should review the progress.
      • Julia will send a reminder to all sites which would not enable reporting by the next meeting.

  • Vladimir then described the tape survey for T1 sites to fill
    • It is about how to use tape more efficiently
    • To be presented and discussed at HEPiX, May 14-18
    • We already see big variances between ATLAS and CMS
    • We need the data from all sites to help us arrive at a common strategy

Action list

Creation date Description Responsible Status Comments
01 Sep 2016 Collect plans from sites to move to EL7 WLCG Operations Ongoing [ older comments suppressed ]
Dec 7 update: Tier-1 plans are documented in the Nov 2 minutes.
Jan 18 update: CREAM and the UI were released in UMD-4 on Dec 18.
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations In progress Jan 26 update: needs to be done in collaboration with EGI.
March 1st update: Maarten will look into this.
14 Sep 2017 Followup of CVMFS configuration changes,
check effects on sites in Asia
WLCG Operations Pending March 1st update: this might imply significant effort; low priority for now.

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

AOB

  • Julia: the early registration deadline for the WLCG & HSF Workshop has been extended up to and including March 5th
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2018-03-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback