LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes190516 (2019-06-17, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, May 16, 2019

Highlights

Agenda

https://indico.cern.ch/event/820489/

Attendance

local: Andrea Manzi (CERN), Fabrizio Furano (CERN), Julia Andreeva (CERN), Mayank Sharma (CERN)

remote: Adrien, Alessandra Doria (Napoli), Baptiste Grenier (EGI), Christoph Wissing (CMS), Di Qing (TRIUMF), Dmitry Golubkov (LHCb), Eric Fede (IN2P3-CC), Felix Lee (ASGC), Giuseppe Bagliesi (CMS), Guillaume Philippon (GRIF), Jerome Pansanel (Strasbourg), Johannes Elmheuser (ATLAS), Jose Flix Molina (PIC), Laurent Duflot (IN2P3/IRFU), Matthew Viljoen (EGI), Puneet Patel (TIFR), Ron Trompert (NLT1), Stephan Lammel (CMS), Thomas Hartmann (DESY)

apologies: Catherine Biscarat

Operations News

Site survey combining the storage and compute related questions has been sent around. In case people see any problem accessing the form, please, contact Julia. We plan to close the survey after the 15th of June. We need site answers for planning long term future for the WLCG service and to help sites in migration from CREAM CE which should happen before the end of 2020.

The next meeting is planned for June 6
- Please let us know if that date would be very inconvenient

Special topics

Followup on the CREAM CE migration workshop

https://indico.cern.ch/event/820489/contributions/3429580/attachments/1845746/3028177/CreamCEMigration.pdf

Baptiste and Julia: EGI and WLCG will work together for preparing documentation and tracking the progress of various sites.
Matthew suggested the creation of a mailing list for sites to get in touch and for sharing outcomes of the task force.
Julia: Next week we agree on membership for the task force, start doing real work on the migrations, create a TWiki page to collect all necessary information.

DPM upgrade task force status report

Overall, it is going very well. Certain sites which have enabled DOME, had a good experience. Some who use gridftp2, have a good feedback as well. List of items in TWiki lists issues and problems. Open issues will be closed in the next minor release. The next release would be after DPM workshop, likely.
Alessandra Doria confirmed that in Napoli things were stable after the upgrade. The issues faced were fixed in the latest release. Napoli recommends that more sites can be looped in as the last release is quite okay.
Atlas reported in the last meeting about data deletion issues with DPM. An Apache configuration patch by Petr seems to help mitigate this. A combination of upgrading to the latest DPM DOME flavor and applying Petr’s Apache configuration seems to be a likely solution at this stage. This issue does not seem to be related to the version of DPM or its configuration (affects any DPM version). Johannes was not aware of this solution. There are 30-40 sites that suffer this issue in a round robin fashion. Johannes suggests that recommendation from the DPM team to keep them in the loop would be helpful regarding such potential solutions. There will be a follow up with sites at the WLCG level, to inform them of the solution and recommendation from DPM team.

Discussion with EGI on the impact of the no-SRM solutions on the EGI operations

https://indico.cern.ch/event/820489/contributions/3432362/attachments/1845873/3028414/EGI-QA-WLCGCoordMay2019.pdf

Julia to EGI: Q6) An important functionality for SRM was space queries. If SRM is not an option for the future, what is foreseen as a replacement for space queries?
Baptiste: Need to discuss with colleagues and get back.

Julia: SRM would not be a generic solution anymore. We also have storages that do not support SRM (EOS, for instance). We should think about something else for data access/removal and other functionalities that should be inspected seriously. This will impact WLCG and EGI sites alike. We should work together on this. Feel free to send email to WLCG operations to take this discussion further.

End of security updates deadline for LCGDM is May 31st 2019 as well. This could be too soon for some EGI sites to upgrade. However, DPM team stress it is important to stick to the upgrade plans. This would be discussed further offline.

Middleware News

Useful Links
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

High activity on average
No major issues
Prague back in business since 3 weeks
- Thanks to a big effort by the Prague admins!
The yearly ALICE T1-T2 workshop was held May 14-16, this time at the Polytechnical University of Bucharest

ATLAS

Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~500k concurrently running job slots and ~15k jobs from Boinc. The HLT farm/Sim@P1 is back into production and adds in its current configuration ~80k job slots in addition.
CentOS7 site migration: as it looks unfortunately not all sites will be migrated until the June 1st deadline - see https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment - can WLCG help to speed up this process ?
Commissioning of a new PanDA worker node pilot version on-going. We are slowly rolling out the new version out to the sites at low scale.
Unchanged situation of DPM deletion problems as reported in the last meeting. In contact with DPM developers, but the current support model does not seem adequate at the scale that DPM is used.

Discussion

Julia asked other experiments how they follow sites in CentOS7 migration. CMS and LHCb are using Singularity and therefore not so much concerned about this migration, nor is ALICE. Stephan Lammel commented that for CMS enabling Singularity, the key point was good documentation with clear instructions in a single place. This is not really relevant for OS migration. The conclusion of the discussion was that apparently, there is no other option rather than submitting GGUS tickets to the sites which delay migration.
- Added 2019-06-07: LHCb will be using Singularity, see here.

CMS

smooth running, compute systems busy at about 250k cores
- usual production/analysis mix (80%/20%)
first (or several) periods of parked B physics data being processed
heavy ion re-reconstruction in progress, about 40% done
2017 and 2018 Monte Carlo production ongoing
disk deletion (ahead of tape deletion) campaign complete
- tape deletion expected to start soon
274 files lost in CERN EOS (due to EOS bug) JIRA
memory issues with CMS database service, DBS, since a few weeks, under investigation

LHCb

Smooth running at ~100K jobs
- usual activity: MC simulation, WG productions, and user analysis

Task Forces and Working Groups

GDPR and WLCG services

Updated list of services

Accounting TF

Archival Storage WG

Information System Evolution TF

CRR format is practically agreed. CRIC developers consumed CRR prototyped by pioneer sites.
During the EGI conference we discussed how we make sure that in case some sites would stop using BDII, EGI can get information required for operation from CRIC. The agreed plan is to follow the same data flow as for publishing CLOUD information in EGI via message bus. We will prototype the data flow with our EGI colleagues.
CMS CRIC is in production and being used by CMS for several months. Now we agreed with CMS the plan to retire SiteDB. The first step is to put it in read-only mode. Will happen next week. Then after one month if everything goes smoothly, SiteDB can be stopped.
On a CMS request, quarterly pledges have been prototyped in the wlcg-cric. We plan to have a demo of the WLCG CRIC at the next GDB in June.

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG

Detailed status update was presented at HEPiX (https://indico.cern.ch/event/765497/contributions/3351215/)
CHEP abstract to be submitted (https://docs.google.com/document/d/1O5PhgCmdwbYJpL7qHpPFxxMLaGh1aWbMWyO69pXk-H0/edit)
perfSONAR infrastructure status - CC7/4.1 campaign
- All T1s updated and re-configured, except TRIUMF (waiting for hw) and RRC-KI (missing IPv6); we have started to follow up with T2s
- Overall we have 176 perfSONARs on 4.1 (137 on 4.1.6); status has significantly improved
- 4.2.0 release soon - will bring preemptive scheduling & gridftp testing
WLCG/OSG network services were updated
- Issues with the psmad dashboard were fixed, dashboard now well populated (OPN, UK and FR meshes in very good shape; psmad/maddash)
- http://monit-grafana-open.cern.ch also now well populated, some issues with site mapping due to IPv6 fixed, others still remain (mostly due to too many sources/complex topology processing)
- New collector is now in production, re-written from scratch within SAND project, improved performance (lowered latency)
- Work is on-going in both SAND and IRIS-HEP to switch all perfSONAR to report measurements directly to the message bus (real-time measurements capability)
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
100 Gbps perfSONARs now at SARA, CERN, CSCS, BNL (80Gbps), KIT (in QA)
perfSONAR now part of the cloud benchmark testing developed in OCRE project (https://github.com/cern-it-efp/OCRE-Testsuite/)
- Will be presented at GEANT perfSONAR workshop (https://wiki.geant.org/display/gn43wp6/European+perfSONAR+workshop+2019+-+London)

Squid Monitoring and HTTP Proxy Discovery TFs

IP ranges from GOCDB are now used by wlcg-wpad to disambiguate sites that share GeoIP organizations and have different squids. It has only affected a few cases so far, but a lot of sites don't have IP ranges registered or have 0.0.0.0. Our plan is to wait until someone has a problem and then ask the site admins to register IP ranges. We should probably add kibana monitoring on wlcg-wpad usage and watch for cases where people attempt to use it and fail.

Traceability WG

Container WG

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

Topic revision: r15 - 2019-06-17 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback