local: Andrea Manzi (CERN), Fabrizio Furano (CERN), Julia Andreeva (CERN), Mayank Sharma (CERN)
remote: Adrien, Alessandra Doria (Napoli), Baptiste Grenier (EGI), Christoph Wissing (CMS), Di Qing (TRIUMF), Dmitry Golubkov (LHCb), Eric Fede (IN2P3-CC), Felix Lee (ASGC), Giuseppe Bagliesi (CMS), Guillaume Philippon (GRIF), Jerome Pansanel (Strasbourg), Johannes Elmheuser (ATLAS), Jose Flix Molina (PIC), Laurent Duflot (IN2P3/IRFU), Matthew Viljoen (EGI), Puneet Patel (TIFR), Ron Trompert (NLT1), Stephan Lammel (CMS), Thomas Hartmann (DESY)
apologies: Catherine Biscarat
Operations News
Site survey combining the storage and compute related questions has been sent around. In case people see any problem accessing the form, please, contact Julia. We plan to close the survey after the 15th of June. We need site answers for planning long term future for the WLCG service and to help sites in migration from CREAM CE which should happen before the end of 2020.
The next meeting is planned for June 6
Please let us know if that date would be very inconvenient
Baptiste and Julia: EGI and WLCG will work together for preparing documentation and tracking the progress of various sites.
Matthew suggested the creation of a mailing list for sites to get in touch and for sharing outcomes of the task force.
Julia: Next week we agree on membership for the task force, start doing real work on the migrations, create a TWiki page to collect all necessary information.
DPM upgrade task force status report
Overall, it is going very well. Certain sites which have enabled DOME, had a good experience. Some who use gridftp2, have a good feedback as well. List of items in TWiki lists issues and problems. Open issues will be closed in the next minor release. The next release would be after DPM workshop, likely.
Alessandra Doria confirmed that in Napoli things were stable after the upgrade. The issues faced were fixed in the latest release. Napoli recommends that more sites can be looped in as the last release is quite okay.
Atlas reported in the last meeting about data deletion issues with DPM. An Apache configuration patch by Petr seems to help mitigate this. A combination of upgrading to the latest DPM DOME flavor and applying Petr’s Apache configuration seems to be a likely solution at this stage. This issue does not seem to be related to the version of DPM or its configuration (affects any DPM version). Johannes was not aware of this solution. There are 30-40 sites that suffer this issue in a round robin fashion. Johannes suggests that recommendation from the DPM team to keep them in the loop would be helpful regarding such potential solutions. There will be a follow up with sites at the WLCG level, to inform them of the solution and recommendation from DPM team.
Discussion with EGI on the impact of the no-SRM solutions on the EGI operations
Julia to EGI: Q6) An important functionality for SRM was space queries. If SRM is not an option for the future, what is foreseen as a replacement for space queries?
Baptiste: Need to discuss with colleagues and get back.
Julia: SRM would not be a generic solution anymore. We also have storages that do not support SRM (EOS, for instance). We should think about something else for data access/removal and other functionalities that should be inspected seriously. This will impact WLCG and EGI sites alike. We should work together on this. Feel free to send email to WLCG operations to take this discussion further.
End of security updates deadline for LCGDM is May 31st 2019 as well. This could be too soon for some EGI sites to upgrade. However, DPM team stress it is important to stick to the upgrade plans. This would be discussed further offline.
The yearly ALICE T1-T2 workshop was held May 14-16, this time at the Polytechnical University of Bucharest
ATLAS
Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~500k concurrently running job slots and ~15k jobs from Boinc. The HLT farm/Sim@P1 is back into production and adds in its current configuration ~80k job slots in addition.
Commissioning of a new PanDA worker node pilot version on-going. We are slowly rolling out the new version out to the sites at low scale.
Unchanged situation of DPM deletion problems as reported in the last meeting. In contact with DPM developers, but the current support model does not seem adequate at the scale that DPM is used.
Discussion
Julia asked other experiments how they follow sites in CentOS7 migration. CMS and LHCb are using Singularity and therefore not so much concerned about this migration, nor is ALICE. Stephan Lammel commented that for CMS enabling Singularity, the key point was good documentation with clear instructions in a single place. This is not really relevant for OS migration. The conclusion of the discussion was that apparently, there is no other option rather than submitting GGUS tickets to the sites which delay migration.
Added 2019-06-07: LHCb will be using Singularity, see here.
CMS
smooth running, compute systems busy at about 250k cores
usual production/analysis mix (80%/20%)
first (or several) periods of parked B physics data being processed
heavy ion re-reconstruction in progress, about 40% done
2017 and 2018 Monte Carlo production ongoing
disk deletion (ahead of tape deletion) campaign complete
CRR format is practically agreed. CRIC developers consumed CRR prototyped by pioneer sites.
During the EGI conference we discussed how we make sure that in case some sites would stop using BDII, EGI can get information required for operation from CRIC. The agreed plan is to follow the same data flow as for publishing CLOUD information in EGI via message bus. We will prototype the data flow with our EGI colleagues.
CMS CRIC is in production and being used by CMS for several months. Now we agreed with CMS the plan to retire SiteDB. The first step is to put it in read-only mode. Will happen next week. Then after one month if everything goes smoothly, SiteDB can be stopped.
On a CMS request, quarterly pledges have been prototyped in the wlcg-cric. We plan to have a demo of the WLCG CRIC at the next GDB in June.
Issues with the psmad dashboard were fixed, dashboard now well populated (OPN, UK and FR meshes in very good shape; psmad/maddash)
http://monit-grafana-open.cern.ch also now well populated, some issues with site mapping due to IPv6 fixed, others still remain (mostly due to too many sources/complex topology processing)
New collector is now in production, re-written from scratch within SAND project, improved performance (lowered latency)
Work is on-going in both SAND and IRIS-HEP to switch all perfSONAR to report measurements directly to the message bus (real-time measurements capability)
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
100 Gbps perfSONARs now at SARA, CERN, CSCS, BNL (80Gbps), KIT (in QA)
IP ranges from GOCDB are now used by wlcg-wpad to disambiguate sites that share GeoIP organizations and have different squids. It has only affected a few cases so far, but a lot of sites don't have IP ranges registered or have 0.0.0.0. Our plan is to wait until someone has a problem and then ask the site admins to register IP ranges. We should probably add kibana monitoring on wlcg-wpad usage and watch for cases where people attempt to use it and fail.