CernVM-FS Coordination Meeting

Europe/Zurich
Zoom (CERN)

Zoom

CERN

    • 17:00 17:20
      Round Table 20m
      • Issue re: unresponsive CVMFS process (https://github.com/cvmfs/cvmfs/issues/3378)
      • Rollout of CVMFS_CACHE_REFCOUNT=yes
      CVMFS Coordination Meeting 11-09-2023
      
      Attending: Valentin Volki, Carl Vuosalo, Dave Dykstra, Dennis van Dok, Laura,
      Thomas Birkett, John De Stefano Jr., Fabrizio Furano.
      
      Valentin: We have released 2.11 recently. There have been a couple of minor packaging
      hiccups for Debian.  Our plan is to stage the roll-out for CVMFS_CACHE_REFCOUNT. We will make it the default with
      the next patch release.
      
      There are one or two more lines of investigation into the unresponsive
      CVMFS client. It is still being investigated. Dennis will give us access
      to a machine.
      One thing we will have to do is centralize and try to find common themes.
      We developers don't have an authenticated repo that matches a data repo.
      We need to set up a test repo.
      
      Dave: Would you like to become a LIGO member to run tests?
      
      Valentin: That would be welcome if you could add me.
      
      Dave: I will get someone to add you. I will need your name and email address.
      
      Valentin: That should be enough
      
      Dave: I will raise it on their Slack channel.
      
      Thomas: I am stepping in for Jose this week. We've identified three repos that
      have hung. The common theme is that they are Stash Cache repos.
      
      Dave: Which repos are those?
      
      Thomas: sbn.osgstorage.org, ligo.storage.igwn.org, ligo.osgstorage.org
      
      Dave: The LIGO ones are authenticated.
      
      Thomas: The way that the Fermilab VO is structured meant that we couldn't confirm
      which experiment was accessing the repo.
      
      Dave: Who was reporting?
      
      Thomas: Katy Elis from RAL.
      
      Dave: It seems to be the SBN experiment.
      
      Thomas: She is UK RAL T1 liasson.
      
      Dave: I think the common theme is that the Stash Cache becomes overloaded. Or
      when they are pulling data into the Stash Cache servers. Valentin, you could 
      reproduce this by deliberately overloading a server? But you need to avoid
      interfering with another experiment. There isn't much danger in spilling over
      to other experiments.
      
      Valentin: Who to talk to?
      
      Thomas: Ken Herner about Fermilab VO. Also David Mason (FNAL T1 manager) has
      been involved.
      
      Valentin: Authentication is part of the problem here. We have to see if it is
      possible to construct a test with authentication.
      
      Valentin: This hanging failure is not what want to have. We can change this.
      
      Laura: Have you tried changing the timeouts in the clients?
      
      Thomas: We did tweak a couple of variables, including timeouts, but that didn't help.
      At the moment, Fermilab and LIGO are not able to submit to our site.
      The failure causes loss of a node, so we can't do testing  without affecting other
      experiments.
      
      Valentin: Ideally, we what to have an environment where we can test this out.
      
      Dennis: We are seeing different issues from different systems. The Stash Cache issue
      caused switch overs. There is a different failure mode with other LIGO jobs.
      I can't find a message about switch overs. I am unable to find it.
      
      Dave: You saw a failure but... 
      
      Dennis: Sorry, I was mistaken. It is the same type of failure. I do see it now.
      Was there any parameter you could set that affects the behavior? We are on 2.10.
      
      Valentin: We recommend updating in any case.
      
      Dennis: We will upgrade soon. I'll keep the machine around to investigate closer.
      It would affect other VOs. We haven't yet banned LIGO, but we have failures daily.
      
      Dave: Have you told LIGO?
      
      Dennis: Not directly. Not sure.
      
      Dave: We need to reach right people in LIGO so they can dial back how hard they are
      hitting sites.
      
      Dennis: LIGO has been running lots of jobs successfully in previous months.
      
      Dave: They could be overloading the origin server. 
      
      Thomas: We had the issue with LIGO in November and they had to be suspended.
      We fixed the issue by updating the LIGO VOMS certificate. That made the problem go 
      away until recently.
      
      Dave: Could be due to the load on the origin server.  They might not be able to watch
      for it.
      
      The origin server is a storage repo. Only metadata is sent through the normal CVMFS
      mechanism. The actual data is in the external servers, external URLS. The CVMFS
      repositories show how to find it on the external server. The CVMFS configuration finds the 
      nearest four servers. Stash Cache is similar to squids except they have faster and bigger
      disk systems and are only at big sites.  If a requested item is not cached, they have to go
      back to the origin servers over XRootD.  The connections between CVMFS and a Stash Cache
      server are http and then use XRootd protocol to read from the origin server.
      
      The total bandwidth is higher around the world. They all read from  the same 
      origin server. It is not hard to overload them because there are jobs all over the
      world reading from them.
      
      Dennis: XCache from Barcelona. And another from Caltech.
      
      Dave: It is surprising from Caltech.
      
      Dennis; If it is not them that are causing problems, maybe shouldn't worry about them.
      
      Dave: It should be closer. The reason why I suspect the origin server is that you have
      the same problem from all four Stash Caches that read from the same origin
      server. They might need to scale down their jobs. If jobs are asking for different
      data, a cache won't help.
      
      Dennis: The result is the CVMFS waits for data and everything else backs up behind it.
      
      Valentin: There should be a different failure mode for this.
      
      Dave: Stash Cache is the old name. New name is OSDF. Open Science Data Federation. 
      
      Dennis: Are there other VOs that use this?
      
      Dave: LIGO is big and authenticated. There are quite a few at Fermilab that use the
      OSDF system unauthenticated. Eventually, there may be EGI experiments that want to
      use this. It is heavily used inside OSG.
      
      Dennis: None of the European experiments. LIGO is world wide. Storage.igwn.org is what?
      
      Dave: It's new in the last month or two. LIGO's code had been on
      oasis.opensciencegrid.org, but they needed their own code repo because it was so big.
      Now it's software.igwn.org. They also wanted two separate storage repositories for LIGO
      and Virgo.  It is still hosted by OSG. The same configurations are used. It's a symbolic
      link to osgstorage.org.
      
      I had offered to request that you become a LIGO member, but it may not be necessary.
      It sounds like I can make you a LIGO member.
      But it may be coincidence. There may be a simpler case to reproduce the problem at Fermilab
      and overload an origin server without bothering with authentication.
      
      Fabrizio: I wanted to ask what is the general feeling about still supporting "/opt" accesses
      to repositories, which we dropped after the refurbishment of the infrastructure at CERN.
      That was a surprise. It is not dead. About the SAM test it is not
      so difficult. Is it the right moment? Jakob and Valentin said they don't remember how
      many years ago it was deprecated. The time to set a date is now.
      
      Dave: Long ago in CVMFS there were CERN and OSG repo names without any domain names. Those
      Stratum 1's support a name like /opt. I checked and found that 15 or 20 squid sites use /opt
      format.  Part of it is the CMS SAM test, but the test looks up the existing configuration of
      the server and tries each one. All these many years later so many are still using this.
      Someone needs to coordinate and tell site administrators. I suggest the squid ops team 
      could do this. We can discuss with the squid ops team during the Frontier ops meeting.
      
      Fabrizio: I will contact WLCG folks and ask for a campaign.
      
      Dave: We could also work through squid operations.
      
      Fabrizio: Maybe both.
      
      Dave: I'm on the mailing list for WLCG operations.
      
      Fabrizio: I will work on it tomorrow.
      
      Carl: How to fix the CVMFS configuration?
      
      Dave: Delete the old configuration files in /etc/cvmfs that override the correct
      configuration.
      
      Fabrizio: Should there be a test program to test for a correct configuration?
      
      Dave: The Squid Ops team could help sites through their CVMFS problems.
      They can be reached at wlcg-squid-ops@cern.ch.
      We have two people, one for ATLAS and one for CMS. For non-CMS issues and non-ATLAS issues
      it goes geographically. They report to the Frontier Ops meeting.
      
      
      Valentin: Thanks to Fabrizio for starting the campaign to fix the obsolete configs.
      
      Dave: I will be at CERN for CMS Offline and Computing week around October 2nd 
      for first time since the pandemic.
      
      John: During bootstrap replication, there was an issue with the replication tools.
      "check -r" is very useful but very slow. When it runs, the end of the run doesn't 
      indicate what the problems were but just gives counts of the problems. It doesn't
      tell you the problem, so it is not useful. It needs to summarize errors at end.
      
      Valentin: The errors are mixed into the very large output, so it would be good to list
      them again at the end. I can create the issue.
      
      John: Thanks.
      
      Dave: The work-around is to send output to file and search through it.
      I run cvmfs_check all the time by cron. The output file would be the same size.
      
      Valentin: On the topic of replication, the biggest repos show recursion and if the
      stack limit is too small, there can be a seg fault.
      
      John: The fix is to increase the file descriptor limit.
      
      Valentin: It's actually the stack size.
      
      John: It was crashing every time until I increased file descriptor limit.
      
      Valentin: With big repos, you might run into limits. We could also put fixes in the
      patch release.
      Anything else?
      
      Laura: The server check utility should be faster.
      
      Valenntin: It is faster in 2.10.1 Duplicate checks are eliminated.
      
      Dave: It took two months to check unpacked.cern.ch before this fix.
      
      Valentin: It would be good to have a meeting when Dave is at CERN.
      October 9th is not good for a meeting. We could have a meeting the week before or
      after.  Either October 2nd or 16th.
      
      John: I prefer the 2nd.
      
      Dave. Oct. 2nd would work, and I'll be at CERN.
      
      Valentin: Next meeting is October 2nd, with a meeting room at CERN. Bye.