WLCG Operations Coordination Minutes, May 7, 2020

Highlights

Agenda

https://indico.cern.ch/event/915551/

Attendance

  • local:
  • remote: Alberto (monitoring), Alessandra (Napoli), Andreas P (KIT), Andreas W (CERN-IT-CDA), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David B (IN2P3-CC), David C (Technion), Eric F (IN2P3), Eric G (CERN-IT-DB), Gavin (T0), Giuseppe (CMS), Johannes (ATLAS), Julia (WLCG), Luca (CERN-IT-ST), Maarten (ALICE + WLCG), Marian (monitoring + networks), Mark (LHCb), Matt (Lancaster), Pedro (monitoring), Pepe (PIC), Renato (LHCb), Ron (NLT1), Stephan (CMS), Tim (CERN-IT-CM), Tony (CERN-IT-CS), Vincent (security)
  • apologies:

Operations News

  • the next meeting is planned for June 4
    • please let us know if that date would be very inconvenient

Special topics

WLCG Critical Services. Review of definitions, impact and urgency.

  • Input from ATLAS: Following the review of the Critical Services https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCritSvc, we thought might be good to review also the granularity of the impact and urgency. There is no problem, it is just that these definitions were done several years ago, it might be good to just review them. Overall, we think that the granularity is too high, we might want to simplify a bit, where possible. For instance, in the urgency, after the experience we gained in the past 10+years, it is unclear if we want to distinguish between 1, 2, 4 hours. It would be good if the Services responsible would explain what we can expect from the service support. Especially in the case of GGUS Alarm ticket.
  • Twiki page with Critical services

Discussion

  • Julia: mind the granularities are to indicate when the full impact is reached

  • Tony:
    • distinguish 24 from 72 hours to see if action may be needed during a weekend
    • experiments ought to have buffers to survive a weekend
    • tickets should in any case be opened as soon as a problem is noticed

  • Maarten:
    • a team ticket can be opened at any time
    • it can be upgraded to an alarm when needed, preferably at a reasonable time

  • Julia: do experiments follow the table or rather have their own instructions?

  • Alessandra: shifters have their own instructions

  • Stephan:
    • only a few people can open team or alarm tickets
    • the table is more for planning

  • Julia:
    • service providers probably do not look at the table either?
    • operations follow experience and best practices
    • the urgencies look too granular

  • Eric G: (Vidyo chat) might regular tests be done according to urgency levels?

  • Stephan: for granularities we also need to take different time zones into account

  • Tony: not all granularities have to be used

  • Maarten: we must not suggest we can distinguish them with such precision

  • Dave M: is there more meaning to those levels?

  • Julia: mind they do not promise how fast problems are solved

  • Mark: is there a consequence if the foreseen time is missed?

  • Julia: no

  • Christoph: in the worst case we would ask for a Service Incident Report

  • Mark: if a level changed, what would be the effect on a service provider?

  • Maarten:
    • we need the table to be realistic
    • discover mismatches between expectations and what is feasible

  • Stephan: agreed, the table should mainly be used for planning

  • Julia:
    • we plan to do some analysis of ticket timelines and compare with the table
    • to be discussed further in experiments and by service providers
    • granularities or anything else
    • we will come back to this in June

Migration of SAM to MONIT infrastructure

see the presentation

Discussion

  • Pepe: sites need a few years of A/R history for funding agency reports

  • Borja: how was this done previously?

  • Julia:
    • the history was essentially kept forever
    • it was very convenient for the WLCG audit we had in 2019
      • to look into the followup of incidents that affected the A/R in the last 2-3 years

  • Borja:
    • all data can be archived in HDFS
    • for special cases we can have special workflows

  • Julia: agreed, but Pepe's use case is more generic

  • Pepe: we do not need detailed granularity all the way, but at least monthly A/R

  • Maarten: we need at least 1 year with the highest granularity

  • Julia:
    • 1 year is OK
    • special workflows can make use of HDFS instead
    • to be checked with the experiments

  • Borja: we will present this also in IT-experiment meetings

  • Stephan: daily summaries should be available forever

  • Borja: for such aggregate results we can have much longer retention periods

  • Pedro: we could have intermediate granularity for 1 year

  • Maarten:
    • we know we cannot keep everything forever, as it would be too expensive
    • we need to find the middle ground for various use cases

  • Julia:
    • we need to be able to navigate to test results to look into A/R drops
    • 1 year would be sufficient for that use case

  • Borja:
    • are the HTML A/R reports really needed?
    • their images take up a lot of inodes and disk space

  • Maarten: let's drop them unless someone comes up with a strong use case

  • Borja: sites without data are ignored for federation A/R - that looks wrong?

  • Maarten: can we have a flag to decide which sites are in or out?

  • Pedro: OK

  • Borja: the VO feed can be used to decide what is production or not

  • Julia:
    • experiments also need to test sites that are not in production
    • a flag may be needed

  • Borja: non-production services can be tested in a different profile

  • Borja: the treatment of unknown status in SAM3 is problematic

  • Julia:
    • we cannot count an unknown status against a site, as it may be our fault
    • we had to favor the sites in the A/R calculations

  • Maarten:
    • various sites are in unknown status due to something being wrong on their end
    • they are lucky not be critical instead
    • now would be a good time to reconsider such cases

  • Julia:
    • their A/R can in any case be recomputed if needed
    • though it can increase load on the Monit team

  • Stephan: what is the TTL of an unknown status?

  • Borja: the granularity is about 15 minutes

  • Maarten: how can an experiment easily launch a recomputation for all sites?

  • Pedro: we will look into adding a feature for that

  • Pedro: the use of GitLab allows audits of recomputation requests later

  • Julia:
    • we will look into a set of questions for feedback from concerned parties
    • a GDB presentation in July would be desirable

  • Borja: July is OK

  • Julia: (off-topic) can you check the REBUS API access logs for unexpected clients?

  • Borja, Pedro: yes, but we will only find frequent clients

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual, thanks to site and CA admins!
  • No major issues.
  • Running up to ~6k concurrent Folding@Home jobs since April 6.

ATLAS

  • Smooth and stable production between 400-450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis.
    This includes about 95k slots from the CERN-P1 HLT farm and about 15k slots from Boinc. In addition, there are occasional additional bursts of ~100k jobs from NERSC/Cori.
  • COVID-19 jobs running stable using 60k of resources in total (13-15%). This comprises 30k from P1 (about 1/3 of the resource) and 30k from about 55 sites as an opt-in at the level of 10% of pledge
  • RAW/DRAW reprocessing campaign using the data/tape carousel now concluded. Full post-mortem mode together with various experts on May 14
  • No other major other issues apart from the usual storage or transfer related problems at sites.
  • Critical services feedback also supplied today
  • Grand unification of PanDA queues continues on a per-cloud basis - 3/4 done.
  • Related to the queue unification: have to fill FZK and RAL and probably soon other large sites with dedicated MCORE queues to efficiently fill the slots - is there configuration sharing of HTCondor setup among big sites ?

Discussion

  • Johannes: we would like to highlight the HTCondor single- vs. multi-core issue

  • Andreas P:
    • ATLAS are asking us to optimize 2 opposite things
    • there is no magic configuration that can just be applied
    • we are in contact with DESY about this

  • Maarten:
    • there are forums where such matters can be discussed between sites
      • HEPiX
      • wlcg-htcondor list
      • wlcg-operations list
      • wlcg-ops-coord list
      • LCG-Rollout
      • ...

  • Julia: we can set up a Twiki page for site recipes

  • Maarten:
    • we will do that if it turns out to be desirable
    • let's first see how things go at the given sites

CMS

  • no Covid-19 related interrupts to the CMS computing infrastructure so far
    • significantly reduced computing capacity due to HLT running Folding@Home and sites contributing to national Covid-19 research or CMS F@H effort
  • jumbo frame issue at CERN impacting several sites, INC:2355684
    • still unresolved
    • after network maintenance, March 11th, OTG:0055311
  • running steadily at about 230k cores during last month
    • usual analysis share of about 60k cores
    • Run 2 Monte Carlo production is largest activity

Discussion

  • Maarten: the jumbo frame ticket is waiting for a reply from the site admin

  • Stephan: we now have involved the admin of another affected site

LHCb

  • Fairly smooth operations, with little impact seen due to current worldwide situation
  • Some sites understandably slower to respond/deal with issues but nothing significant
  • Currently running ~15K Folding@Home jobs on the HLT Farm
  • Current jobs consist of usual mix of MC production and user jobs.
  • Have ticketed Tier 2 sites to ask them about switching to CentOS 7. Most have this planned but need to wait until regular access returns.

Task Forces and Working Groups

GDPR and WLCG services

  • Updated list of services
  • Started to work on enabling of the WLCG privacy notice for the central and experiment-specific services
  • Many services hosted by CERN have already drafted CERN RoPO
  • Though the content of the CERN RoPO is very much the same as the one of the WLCG Privacy Notice, the scope and approval workflow is different
  • Need to better understand how we go about approval, will bring it to the WLCG MB this month.

Accounting TF

  • March accounting reports generated by CRIC were sent around both for T1 and T2. We plan that reports for April generated in May will be generated by the EGI portal last time. Starting from May (reports generated in June), CRIC reports will become official
  • Changes in the accounting reports generated by CRIC vs EGI reports
    • Instead of T1 storage accounting data (disk and tape) manually injected in REBUS, WSSA data is used
    • Disk storage accounting is available also for T2 sites
    • Long standing issue with DESY for T2 reports has been fixed
    • All accounting data generated by APEL or WSSA is being validated by sites. Validated data is used for the reports

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 7 done: 3 ARC, 4 HTCondor
  • 18 sites plan for ARC, 14 are considering it
  • 18 sites plan for HTCondor, 15 are considering it, 8 consider using SIMPLE
  • 15 tickets on hold, to be continued in a number of months
  • 9 tickets without reply
    • response times possibly affected by COVID-19 measures

dCache upgrade TF

  • Not much progress during last month

DPM upgrade TF

  • 38 sites upgraded and reconfigured with DOME

http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dpm&version=DOME&show_11=0&show_18=0

  • 1 to upgrade and re-configure, in progress
  • 5 upgraded need to reconfigure
  • 1 site is suspended for operations
  • 9 moving away from DPM

Information System Evolution TF

  • Migration of REBUS to CRIC is progressing according to the schedule. REBUS is in readonly mode starting from the beginning of April. Plan to retire REBUS in the beginning of June

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • perfSONAR infrastructure status - 4.2.4 versions was released - please upgrade
  • Update on the WG activities will be presented next week at the virtual LHCOPN/LHCONE workshop (https://indico.cern.ch/event/888924/)
  • OSG/WLCG infrastructure
    • New dashboards are now available providing high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
    • Working on a new LHCONE mesh that will focus on testing from sites to R&E endpoints
    • Meeting with perfSONAR developers this week on publishing measurements to message bus directly from perfSONAR toolkit - discussed different options and possible strategy going forward
    • ESnet (router) traffic feed now available, working on its integration to our pipeline - prototype already working
    • Also started working on integration of the OSG HTCondor jobs statistics (network related) - will be added to our pipeline and stream
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

  • Renato: LHCb thanks the MONIT team for the great support !

  • Stephan: CMS also thanks the team!
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2020-05-13 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback