CernVM-FS Coordination Meeting
→
Europe/Zurich
Zoom (CERN)
Zoom
CERN
-
-
17:00
→
17:20
Round Table 20m
- Issue re: unresponsive CVMFS process (https://github.com/cvmfs/cvmfs/issues/3378)
- Rollout of CVMFS_CACHE_REFCOUNT=yes
CVMFS Coordination Meeting 11-09-2023 Attending: Valentin Volki, Carl Vuosalo, Dave Dykstra, Dennis van Dok, Laura, Thomas Birkett, John De Stefano Jr., Fabrizio Furano. Valentin: We have released 2.11 recently. There have been a couple of minor packaging hiccups for Debian. Our plan is to stage the roll-out for CVMFS_CACHE_REFCOUNT. We will make it the default with the next patch release. There are one or two more lines of investigation into the unresponsive CVMFS client. It is still being investigated. Dennis will give us access to a machine. One thing we will have to do is centralize and try to find common themes. We developers don't have an authenticated repo that matches a data repo. We need to set up a test repo. Dave: Would you like to become a LIGO member to run tests? Valentin: That would be welcome if you could add me. Dave: I will get someone to add you. I will need your name and email address. Valentin: That should be enough Dave: I will raise it on their Slack channel. Thomas: I am stepping in for Jose this week. We've identified three repos that have hung. The common theme is that they are Stash Cache repos. Dave: Which repos are those? Thomas: sbn.osgstorage.org, ligo.storage.igwn.org, ligo.osgstorage.org Dave: The LIGO ones are authenticated. Thomas: The way that the Fermilab VO is structured meant that we couldn't confirm which experiment was accessing the repo. Dave: Who was reporting? Thomas: Katy Elis from RAL. Dave: It seems to be the SBN experiment. Thomas: She is UK RAL T1 liasson. Dave: I think the common theme is that the Stash Cache becomes overloaded. Or when they are pulling data into the Stash Cache servers. Valentin, you could reproduce this by deliberately overloading a server? But you need to avoid interfering with another experiment. There isn't much danger in spilling over to other experiments. Valentin: Who to talk to? Thomas: Ken Herner about Fermilab VO. Also David Mason (FNAL T1 manager) has been involved. Valentin: Authentication is part of the problem here. We have to see if it is possible to construct a test with authentication. Valentin: This hanging failure is not what want to have. We can change this. Laura: Have you tried changing the timeouts in the clients? Thomas: We did tweak a couple of variables, including timeouts, but that didn't help. At the moment, Fermilab and LIGO are not able to submit to our site. The failure causes loss of a node, so we can't do testing without affecting other experiments. Valentin: Ideally, we what to have an environment where we can test this out. Dennis: We are seeing different issues from different systems. The Stash Cache issue caused switch overs. There is a different failure mode with other LIGO jobs. I can't find a message about switch overs. I am unable to find it. Dave: You saw a failure but... Dennis: Sorry, I was mistaken. It is the same type of failure. I do see it now. Was there any parameter you could set that affects the behavior? We are on 2.10. Valentin: We recommend updating in any case. Dennis: We will upgrade soon. I'll keep the machine around to investigate closer. It would affect other VOs. We haven't yet banned LIGO, but we have failures daily. Dave: Have you told LIGO? Dennis: Not directly. Not sure. Dave: We need to reach right people in LIGO so they can dial back how hard they are hitting sites. Dennis: LIGO has been running lots of jobs successfully in previous months. Dave: They could be overloading the origin server. Thomas: We had the issue with LIGO in November and they had to be suspended. We fixed the issue by updating the LIGO VOMS certificate. That made the problem go away until recently. Dave: Could be due to the load on the origin server. They might not be able to watch for it. The origin server is a storage repo. Only metadata is sent through the normal CVMFS mechanism. The actual data is in the external servers, external URLS. The CVMFS repositories show how to find it on the external server. The CVMFS configuration finds the nearest four servers. Stash Cache is similar to squids except they have faster and bigger disk systems and are only at big sites. If a requested item is not cached, they have to go back to the origin servers over XRootD. The connections between CVMFS and a Stash Cache server are http and then use XRootd protocol to read from the origin server. The total bandwidth is higher around the world. They all read from the same origin server. It is not hard to overload them because there are jobs all over the world reading from them. Dennis: XCache from Barcelona. And another from Caltech. Dave: It is surprising from Caltech. Dennis; If it is not them that are causing problems, maybe shouldn't worry about them. Dave: It should be closer. The reason why I suspect the origin server is that you have the same problem from all four Stash Caches that read from the same origin server. They might need to scale down their jobs. If jobs are asking for different data, a cache won't help. Dennis: The result is the CVMFS waits for data and everything else backs up behind it. Valentin: There should be a different failure mode for this. Dave: Stash Cache is the old name. New name is OSDF. Open Science Data Federation. Dennis: Are there other VOs that use this? Dave: LIGO is big and authenticated. There are quite a few at Fermilab that use the OSDF system unauthenticated. Eventually, there may be EGI experiments that want to use this. It is heavily used inside OSG. Dennis: None of the European experiments. LIGO is world wide. Storage.igwn.org is what? Dave: It's new in the last month or two. LIGO's code had been on oasis.opensciencegrid.org, but they needed their own code repo because it was so big. Now it's software.igwn.org. They also wanted two separate storage repositories for LIGO and Virgo. It is still hosted by OSG. The same configurations are used. It's a symbolic link to osgstorage.org. I had offered to request that you become a LIGO member, but it may not be necessary. It sounds like I can make you a LIGO member. But it may be coincidence. There may be a simpler case to reproduce the problem at Fermilab and overload an origin server without bothering with authentication. Fabrizio: I wanted to ask what is the general feeling about still supporting "/opt" accesses to repositories, which we dropped after the refurbishment of the infrastructure at CERN. That was a surprise. It is not dead. About the SAM test it is not so difficult. Is it the right moment? Jakob and Valentin said they don't remember how many years ago it was deprecated. The time to set a date is now. Dave: Long ago in CVMFS there were CERN and OSG repo names without any domain names. Those Stratum 1's support a name like /opt. I checked and found that 15 or 20 squid sites use /opt format. Part of it is the CMS SAM test, but the test looks up the existing configuration of the server and tries each one. All these many years later so many are still using this. Someone needs to coordinate and tell site administrators. I suggest the squid ops team could do this. We can discuss with the squid ops team during the Frontier ops meeting. Fabrizio: I will contact WLCG folks and ask for a campaign. Dave: We could also work through squid operations. Fabrizio: Maybe both. Dave: I'm on the mailing list for WLCG operations. Fabrizio: I will work on it tomorrow. Carl: How to fix the CVMFS configuration? Dave: Delete the old configuration files in /etc/cvmfs that override the correct configuration. Fabrizio: Should there be a test program to test for a correct configuration? Dave: The Squid Ops team could help sites through their CVMFS problems. They can be reached at wlcg-squid-ops@cern.ch. We have two people, one for ATLAS and one for CMS. For non-CMS issues and non-ATLAS issues it goes geographically. They report to the Frontier Ops meeting. Valentin: Thanks to Fabrizio for starting the campaign to fix the obsolete configs. Dave: I will be at CERN for CMS Offline and Computing week around October 2nd for first time since the pandemic. John: During bootstrap replication, there was an issue with the replication tools. "check -r" is very useful but very slow. When it runs, the end of the run doesn't indicate what the problems were but just gives counts of the problems. It doesn't tell you the problem, so it is not useful. It needs to summarize errors at end. Valentin: The errors are mixed into the very large output, so it would be good to list them again at the end. I can create the issue. John: Thanks. Dave: The work-around is to send output to file and search through it. I run cvmfs_check all the time by cron. The output file would be the same size. Valentin: On the topic of replication, the biggest repos show recursion and if the stack limit is too small, there can be a seg fault. John: The fix is to increase the file descriptor limit. Valentin: It's actually the stack size. John: It was crashing every time until I increased file descriptor limit. Valentin: With big repos, you might run into limits. We could also put fixes in the patch release. Anything else? Laura: The server check utility should be faster. Valenntin: It is faster in 2.10.1 Duplicate checks are eliminated. Dave: It took two months to check unpacked.cern.ch before this fix. Valentin: It would be good to have a meeting when Dave is at CERN. October 9th is not good for a meeting. We could have a meeting the week before or after. Either October 2nd or 16th. John: I prefer the 2nd. Dave. Oct. 2nd would work, and I'll be at CERN. Valentin: Next meeting is October 2nd, with a meeting room at CERN. Bye.
-
17:00
→
17:20