remote: Alberto (monitoring), Alessandra (Napoli), Andreas P (KIT), Andreas W (CERN-IT-CDA), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David B (IN2P3-CC), David C (Technion), Eric F (IN2P3), Eric G (CERN-IT-DB), Gavin (T0), Giuseppe (CMS), Johannes (ATLAS), Julia (WLCG), Luca (CERN-IT-ST), Maarten (ALICE + WLCG), Marian (monitoring + networks), Mark (LHCb), Matt (Lancaster), Pedro (monitoring), Pepe (PIC), Renato (LHCb), Ron (NLT1), Stephan (CMS), Tim (CERN-IT-CM), Tony (CERN-IT-CS), Vincent (security)
apologies:
Operations News
the next meeting is planned for June 4
please let us know if that date would be very inconvenient
Special topics
WLCG Critical Services. Review of definitions, impact and urgency.
Input from ATLAS: Following the review of the Critical Services https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCritSvc, we thought might be good to review also the granularity of the impact and urgency. There is no problem, it is just that these definitions were done several years ago, it might be good to just review them. Overall, we think that the granularity is too high, we might want to simplify a bit, where possible. For instance, in the urgency, after the experience we gained in the past 10+years, it is unclear if we want to distinguish between 1, 2, 4 hours. It would be good if the Services responsible would explain what we can expect from the service support. Especially in the case of GGUS Alarm ticket.
Mostly business as usual, thanks to site and CA admins!
No major issues.
Running up to ~6k concurrent Folding@Home jobs since April 6.
ATLAS
Smooth and stable production between 400-450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis. This includes about 95k slots from the CERN-P1 HLT farm and about 15k slots from Boinc. In addition, there are occasional additional bursts of ~100k jobs from NERSC/Cori.
COVID-19 jobs running stable using 60k of resources in total (13-15%). This comprises 30k from P1 (about 1/3 of the resource) and 30k from about 55 sites as an opt-in at the level of 10% of pledge
RAW/DRAW reprocessing campaign using the data/tape carousel now concluded. Full post-mortem mode together with various experts on May 14
No other major other issues apart from the usual storage or transfer related problems at sites.
Critical services feedback also supplied today
Grand unification of PanDA queues continues on a per-cloud basis - 3/4 done.
Related to the queue unification: have to fill FZK and RAL and probably soon other large sites with dedicated MCORE queues to efficiently fill the slots - is there configuration sharing of HTCondor setup among big sites ?
Discussion
Johannes: we would like to highlight the HTCondor single- vs. multi-core issue
Andreas P:
ATLAS are asking us to optimize 2 opposite things
there is no magic configuration that can just be applied
we are in contact with DESY about this
Maarten:
there are forums where such matters can be discussed between sites
HEPiX
wlcg-htcondor list
wlcg-operations list
wlcg-ops-coord list
LCG-Rollout
...
Julia: we can set up a Twiki page for site recipes
Maarten:
we will do that if it turns out to be desirable
let's first see how things go at the given sites
CMS
no Covid-19 related interrupts to the CMS computing infrastructure so far
significantly reduced computing capacity due to HLT running Folding@Home and sites contributing to national Covid-19 research or CMS F@H effort
jumbo frame issue at CERN impacting several sites, INC:2355684
still unresolved
after network maintenance, March 11th, OTG:0055311
running steadily at about 230k cores during last month
usual analysis share of about 60k cores
Run 2 Monte Carlo production is largest activity
Discussion
Maarten: the jumbo frame ticket is waiting for a reply from the site admin
Stephan: we now have involved the admin of another affected site
LHCb
Fairly smooth operations, with little impact seen due to current worldwide situation
Some sites understandably slower to respond/deal with issues but nothing significant
Currently running ~15K Folding@Home jobs on the HLT Farm
Current jobs consist of usual mix of MC production and user jobs.
Have ticketed Tier 2 sites to ask them about switching to CentOS 7. Most have this planned but need to wait until regular access returns.
Started to work on enabling of the WLCG privacy notice for the central and experiment-specific services
Many services hosted by CERN have already drafted CERN RoPO
Though the content of the CERN RoPO is very much the same as the one of the WLCG Privacy Notice, the scope and approval workflow is different
Need to better understand how we go about approval, will bring it to the WLCG MB this month.
Accounting TF
March accounting reports generated by CRIC were sent around both for T1 and T2. We plan that reports for April generated in May will be generated by the EGI portal last time. Starting from May (reports generated in June), CRIC reports will become official
Changes in the accounting reports generated by CRIC vs EGI reports
Instead of T1 storage accounting data (disk and tape) manually injected in REBUS, WSSA data is used
Disk storage accounting is available also for T2 sites
Long standing issue with DESY for T2 reports has been fixed
All accounting data generated by APEL or WSSA is being validated by sites. Validated data is used for the reports
Migration of REBUS to CRIC is progressing according to the schedule. REBUS is in readonly mode starting from the beginning of April. Plan to retire REBUS in the beginning of June
Working on a new LHCONE mesh that will focus on testing from sites to R&E endpoints
Meeting with perfSONAR developers this week on publishing measurements to message bus directly from perfSONAR toolkit - discussed different options and possible strategy going forward
ESnet (router) traffic feed now available, working on its integration to our pipeline - prototype already working
Also started working on integration of the OSG HTCondor jobs statistics (network related) - will be added to our pipeline and stream
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.