LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes200507 (2020-05-13, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, May 7, 2020

Highlights

WLCG Critical Services. Review of definitions, impact and urgency.

Migration of SAM to MONIT infrastructure

Agenda

https://indico.cern.ch/event/915551/

Attendance

local:
remote: Alberto (monitoring), Alessandra (Napoli), Andreas P (KIT), Andreas W (CERN-IT-CDA), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David B (IN2P3-CC), David C (Technion), Eric F (IN2P3), Eric G (CERN-IT-DB), Gavin (T0), Giuseppe (CMS), Johannes (ATLAS), Julia (WLCG), Luca (CERN-IT-ST), Maarten (ALICE + WLCG), Marian (monitoring + networks), Mark (LHCb), Matt (Lancaster), Pedro (monitoring), Pepe (PIC), Renato (LHCb), Ron (NLT1), Stephan (CMS), Tim (CERN-IT-CM), Tony (CERN-IT-CS), Vincent (security)
apologies:

Operations News

the next meeting is planned for June 4
- please let us know if that date would be very inconvenient

Special topics

WLCG Critical Services. Review of definitions, impact and urgency.

Input from ATLAS: Following the review of the Critical Services https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCritSvc, we thought might be good to review also the granularity of the impact and urgency. There is no problem, it is just that these definitions were done several years ago, it might be good to just review them. Overall, we think that the granularity is too high, we might want to simplify a bit, where possible. For instance, in the urgency, after the experience we gained in the past 10+years, it is unclear if we want to distinguish between 1, 2, 4 hours. It would be good if the Services responsible would explain what we can expect from the service support. Especially in the case of GGUS Alarm ticket.
Twiki page with Critical services

Discussion

Julia: mind the granularities are to indicate when the full impact is reached

Tony:
- distinguish 24 from 72 hours to see if action may be needed during a weekend
- experiments ought to have buffers to survive a weekend
- tickets should in any case be opened as soon as a problem is noticed

Maarten:
- a team ticket can be opened at any time
- it can be upgraded to an alarm when needed, preferably at a reasonable time

Julia: do experiments follow the table or rather have their own instructions?

Alessandra: shifters have their own instructions

Stephan:
- only a few people can open team or alarm tickets
- the table is more for planning

Julia:
- service providers probably do not look at the table either?
- operations follow experience and best practices
- the urgencies look too granular

Eric G: (Vidyo chat) might regular tests be done according to urgency levels?

Stephan: for granularities we also need to take different time zones into account

Tony: not all granularities have to be used

Maarten: we must not suggest we can distinguish them with such precision

Dave M: is there more meaning to those levels?

Julia: mind they do not promise how fast problems are solved

Mark: is there a consequence if the foreseen time is missed?

Julia: no

Christoph: in the worst case we would ask for a Service Incident Report

Mark: if a level changed, what would be the effect on a service provider?

Maarten:
- we need the table to be realistic
- discover mismatches between expectations and what is feasible

Stephan: agreed, the table should mainly be used for planning

Julia:
- we plan to do some analysis of ticket timelines and compare with the table
- to be discussed further in experiments and by service providers
- granularities or anything else
- we will come back to this in June

Migration of SAM to MONIT infrastructure

see the presentation

Discussion

Pepe: sites need a few years of A/R history for funding agency reports

Borja: how was this done previously?

Julia:
- the history was essentially kept forever
- it was very convenient for the WLCG audit we had in 2019
  - to look into the followup of incidents that affected the A/R in the last 2-3 years

Borja:
- all data can be archived in HDFS
- for special cases we can have special workflows

Julia: agreed, but Pepe's use case is more generic

Pepe: we do not need detailed granularity all the way, but at least monthly A/R

Maarten: we need at least 1 year with the highest granularity

Julia:
- 1 year is OK
- special workflows can make use of HDFS instead
- to be checked with the experiments

Borja: we will present this also in IT-experiment meetings

Stephan: daily summaries should be available forever

Borja: for such aggregate results we can have much longer retention periods

Pedro: we could have intermediate granularity for 1 year

Maarten:
- we know we cannot keep everything forever, as it would be too expensive
- we need to find the middle ground for various use cases

Julia:
- we need to be able to navigate to test results to look into A/R drops
- 1 year would be sufficient for that use case

Borja:
- are the HTML A/R reports really needed?
- their images take up a lot of inodes and disk space

Maarten: let's drop them unless someone comes up with a strong use case

Borja: sites without data are ignored for federation A/R - that looks wrong?

Maarten: can we have a flag to decide which sites are in or out?

Pedro: OK

Borja: the VO feed can be used to decide what is production or not

Julia:
- experiments also need to test sites that are not in production
- a flag may be needed

Borja: non-production services can be tested in a different profile

Borja: the treatment of unknown status in SAM3 is problematic

Julia:
- we cannot count an unknown status against a site, as it may be our fault
- we had to favor the sites in the A/R calculations

Maarten:
- various sites are in unknown status due to something being wrong on their end
- they are lucky not be critical instead
- now would be a good time to reconsider such cases

Julia:
- their A/R can in any case be recomputed if needed
- though it can increase load on the Monit team

Stephan: what is the TTL of an unknown status?

Borja: the granularity is about 15 minutes

Maarten: how can an experiment easily launch a recomputation for all sites?

Pedro: we will look into adding a feature for that

Pedro: the use of GitLab allows audits of recomputation requests later

Julia:
- we will look into a set of questions for feedback from concerned parties
- a GDB presentation in July would be desirable

Borja: July is OK

Julia: (off-topic) can you check the REBUS API access logs for unexpected clients?

Borja, Pedro: yes, but we will only find frequent clients

Middleware News

Useful Links
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Mostly business as usual, thanks to site and CA admins!
No major issues.
Running up to ~6k concurrent Folding@Home jobs since April 6.

ATLAS

Smooth and stable production between 400-450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis.
This includes about 95k slots from the CERN-P1 HLT farm and about 15k slots from Boinc. In addition, there are occasional additional bursts of ~100k jobs from NERSC/Cori.
COVID-19 jobs running stable using 60k of resources in total (13-15%). This comprises 30k from P1 (about 1/3 of the resource) and 30k from about 55 sites as an opt-in at the level of 10% of pledge
RAW/DRAW reprocessing campaign using the data/tape carousel now concluded. Full post-mortem mode together with various experts on May 14
No other major other issues apart from the usual storage or transfer related problems at sites.
Critical services feedback also supplied today
Grand unification of PanDA queues continues on a per-cloud basis - 3/4 done.
Related to the queue unification: have to fill FZK and RAL and probably soon other large sites with dedicated MCORE queues to efficiently fill the slots - is there configuration sharing of HTCondor setup among big sites ?

Discussion

Johannes: we would like to highlight the HTCondor single- vs. multi-core issue

Andreas P:
- ATLAS are asking us to optimize 2 opposite things
- there is no magic configuration that can just be applied
- we are in contact with DESY about this

Maarten:
- there are forums where such matters can be discussed between sites
  - HEPiX
  - wlcg-htcondor list
  - wlcg-operations list
  - wlcg-ops-coord list
  - LCG-Rollout
  - ...

Julia: we can set up a Twiki page for site recipes

Maarten:
- we will do that if it turns out to be desirable
- let's first see how things go at the given sites

CMS

no Covid-19 related interrupts to the CMS computing infrastructure so far
- significantly reduced computing capacity due to HLT running Folding@Home and sites contributing to national Covid-19 research or CMS F@H effort
jumbo frame issue at CERN impacting several sites, INC:2355684
- still unresolved
- after network maintenance, March 11th, OTG:0055311
running steadily at about 230k cores during last month
- usual analysis share of about 60k cores
- Run 2 Monte Carlo production is largest activity

Discussion

Maarten: the jumbo frame ticket is waiting for a reply from the site admin

Stephan: we now have involved the admin of another affected site

LHCb

Fairly smooth operations, with little impact seen due to current worldwide situation
Some sites understandably slower to respond/deal with issues but nothing significant
Currently running ~15K Folding@Home jobs on the HLT Farm
Current jobs consist of usual mix of MC production and user jobs.
Have ticketed Tier 2 sites to ask them about switching to CentOS 7. Most have this planned but need to wait until regular access returns.

Task Forces and Working Groups

GDPR and WLCG services

Updated list of services
Started to work on enabling of the WLCG privacy notice for the central and experiment-specific services
Many services hosted by CERN have already drafted CERN RoPO
Though the content of the CERN RoPO is very much the same as the one of the WLCG Privacy Notice, the scope and approval workflow is different
Need to better understand how we go about approval, will bring it to the WLCG MB this month.

Accounting TF

March accounting reports generated by CRIC were sent around both for T1 and T2. We plan that reports for April generated in May will be generated by the EGI portal last time. Starting from May (reports generated in June), CRIC reports will become official
Changes in the accounting reports generated by CRIC vs EGI reports
- Instead of T1 storage accounting data (disk and tape) manually injected in REBUS, WSSA data is used
- Disk storage accounting is available also for T2 sites
- Long standing issue with DESY for T2 reports has been fixed
- All accounting data generated by APEL or WSSA is being validated by sites. Validated data is used for the reports

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

90 tickets
7 done: 3 ARC, 4 HTCondor
18 sites plan for ARC, 14 are considering it
18 sites plan for HTCondor, 15 are considering it, 8 consider using SIMPLE
15 tickets on hold, to be continued in a number of months
9 tickets without reply
- response times possibly affected by COVID-19 measures

dCache upgrade TF

Not much progress during last month

DPM upgrade TF

38 sites upgraded and reconfigured with DOME

http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dpm&version=DOME&show_11=0&show_18=0

1 to upgrade and re-configure, in progress
5 upgraded need to reconfigure
1 site is suspended for operations
9 moving away from DPM

Information System Evolution TF

Migration of REBUS to CRIC is progressing according to the schedule. REBUS is in readonly mode starting from the beginning of April. Plan to retire REBUS in the beginning of June

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG

perfSONAR infrastructure status - 4.2.4 versions was released - please upgrade
- Release notes: https://www.perfsonar.net/releasenotes-2020-04-01-4-2-4.html
Update on the WG activities will be presented next week at the virtual LHCOPN/LHCONE workshop (https://indico.cern.ch/event/888924/)
OSG/WLCG infrastructure
- New dashboards are now available providing high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
- Working on a new LHCONE mesh that will focus on testing from sites to R&E endpoints
- Meeting with perfSONAR developers this week on publishing measurements to message bus directly from perfSONAR toolkit - discussed different options and possible strategy going forward
- ESnet (router) traffic feed now available, working on its integration to our pipeline - prototype already working
- Also started working on integration of the OSG HTCondor jobs statistics (network related) - will be added to our pipeline and stream
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

Renato: LHCb thanks the MONIT team for the great support !

Stephan: CMS also thanks the team!

Topic revision: r19 - 2020-05-13 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback