WLCG AuthZ Call

Europe/Zurich
Description

Previous Actions:


Proposed agenda:

  • Review Previous Actions

 

Zoom meeting:

Link below, in the videoconference section. Please ensure you are signed in to Indico to see the meeting password!

Next Meeting: 

  • 16th December
Videoconference
WLCG AuthZ Call
Zoom Meeting ID
61554826915
Description
Zoom room for WLCG AuthZ Call
Host
Tom Dack
Alternative host
Hannah Short
Useful links
Join via phone
Zoom URL

Input from Hannah prior to the meeting:

  • r.e. NGINX rate limiting: 
    • Should be easy to put in place but we need to have an idea of what is reasonable expected traffic from a particular IP
  • r.e. access token storage
    • Theoretically no need to store
    • Understood from CNAF that this is a dependency on the underlying mitreID library so not trivial to remove

Present: Alexandre, Andrei, Brian, Christophe, Dave, Dimitrios, Doug, Francesco, Jeff, Jim, John, Julie, Liz, Maarten (notes), Martin, Max, Mine, Petr, Roberta, Stefano, Sven

Apologies: Tom, Hannah

Notes: (please send corrections)

Mine reported there was some anxiety and/or confusion about what the end of GSI support would imply for CMS central services and user workflows. Brian pointed out the end of GSI support only pertains to the HTCondor CE and that the support of GSI in CMS central services relies on CMS's own code. Maarten added that users normally will not directly interact with CEs themselves, but rather submit their jobs e.g. to CRAB instead and that there is no problem for users to keep using X509 VOMS proxies for that purpose. Furthermore, there currently is no timeline for any SE implementation to phase out the support of those proxies. Where the Grid Community Toolkit (GCT a.k.a. Globus) is used to implement GSI support, we are dependent on continued support of it, but there are no concerns about that at this time. That said, as no institute in the Grid Community Forum has committed itself to the support, the best is for us to keep reducing our dependencies on the GCT. Brian added that dCache and XRootD have their own implementations of GSI, but that there is a potential concern about GridSite, the library used to make httpd-based services VOMS-aware. Maarten replied that the GridSite devs (CESNET, Prague) would like to set an EOL on that product, instead of supporting it also on EL9. In any case, as we move away from X509 and VOMS, it should steadily become less relevant in the coming years.

Next, Brian reported that Jaime Frey of the HTCondor team has started working on the token plugin callout interface that in particular will allow EGI Check-in tokens to be used for job submissions by VOs that will depend on those. As there has been no news about the EGI Check-in plugin state of affairs, Maarten will rekindle a thread involving the relevant parties. Brian added that the initial requirement for Check-in tokens to include compute scopes is no longer there, as it will be the plugin that decides if a token is acceptable or not. It would still be advisable for plugins not to do callbacks to the user info endpoint, however. Maarten added that we may need to impose some requirements on the plugin behavior, to avoid that under certain circumstances it could keep HTCondor CE threads occupied for a long time and cause a DoS not unlike the ones we have seen during the Halloween incident. Brian added that only the token signature is checked before the plugin callout. Petr then asked what the timeline would be for an official release of the plugin callout machinery. Brian replied that as far as HTCondor is concerned, the release will be well in time to allow experience to be gained in production before the EOL of the last HTCondor series supporting GSI, but that we will need to have the Check-in plugin timeline clarified ASAP. He added that it would be good to have a real VO use case in the loop to help drive this. Andrei replied that DIRAC will typically be in the middle for VOs that will use Check-in tokens and that the first use case would thus involve DIRAC. Maarten added that another use case will be the availability tests submitted through the Ops VO and wondered if the switch to tokens could be an opportunity to let DIRAC also be in the middle for those. He will follow up with EGI. Andrei then asked what the plans are for ARC CEs? Brian replied that during the token hackathon at NIKHEF the plugin machinery was agreed between HTCondor and ARC developers such that the same code can work in both CE types. Maarten added that the bigger challenge for ARC is how to do data management with tokens, but that that does not look urgent, though. Andrei replied DIRAC is only concerned with simple job submissions.

Next, Petr reminded us of a long e-mail thread about scopes vs. groups in the WLCG profile, which was not concluded yet. Maarten replied that Paul had been asked to submit a pull request with revised wording that should remove the current ambiguity while not forcing services to implement awkward functionality. He will follow up.

Maarten then asked if the IAM development team had a chance to look into whether we can avoid that our access tokens get stored in the DB. Francesco replied it looks possible, but that a policy would be needed on what information needs to be logged. Maarten replied the logs need to record who requested a given token, the granted scopes, lifetime etc. He added that these changes are not urgent at this time, but that we would like to have them some time next year, to help protect the IAM services against higher loads than usual. The new FTE expected to join the CERN IAM team in February could contribute in that area, as we will also need the right tooling to query and analyze the logs. Francesco added we should also consider rate limiting. Brian replied the current maximum rates can be obtained from the logs and can already be used for limits today: new use cases requiring higher rates would first need to be agreed before their limits are then adjusted accordingly. He added it would be good to let the JWKS URL point to a static file, to allow it still to be served even when the DB is unresponsive. Francesco will have that looked into, adding that it also ties in with the key rotation, currently unimplemented. Brian added this matter could be handled on the Kubernetes layer. Maarten pointed out we need to have GitHub issues opened about these various items.

Next, Dave reminded us of the various pull requests still open for the WLCG profile document: we should resolve them, tag and release v1.1 of the profile. In particular, #19 is important for CILogon and IAM. Jim has gone ahead and implemented support for removing tokens after a grace period, to help keep the DB size reasonable.

Finally, Petr pointed out the WLCG IAM instances have default token lifetimes that are IAM defaults, when they should be WLCG profile defaults: refresh tokens have infinite lifetimes, access tokens 1 hour. He proposed the WLCG profile defaults to be applied, to see what implications, if any, they might have on the service stability. That is another action item then.

 

There are minutes attached to this event. Show them.
The agenda of this meeting is empty