WLCG AuthZ Call

Name: WLCG AuthZ Call
Start: 2024-04-25T16:00:00+02:00
End: 2024-04-25T17:00:00+02:00
Location: No location set

Thursday 25 Apr 2024, 16:00 → 17:00 Europe/Zurich

Description

Previous Actions:

Proposed agenda:

High-priority IAM issues working document
- Please check it out and comment where needed

Token profile PRs and open issues

Zoom meeting:

Link below, in the videoconference section. Please ensure you are signed in to Indico to see the meeting password!

Next Meeting:

23 May 2024

61554826915

Zoom room for WLCG AuthZ Call

Tom Dack

Maarten Litmaath, Hannah Short

Join via phone

Hide

Present: Angela, Berk, Dave D, Dimitrios, Linda, Maarten (notes), Matt, Petr, Stephan

Apologies: Enrico, Federica, Francesco, Roberta, Tom

Notes:

Maarten reports that IAM instances on Kubernetes (K8s) are now available for all VOs supported by CERN and were announced to the experiments on Monday. K8s expert Antonio Nappi has already helped Berk with the deployment and is looking into HA options for when we will fully depend on K8s for our IAM instances. Berk adds that the current K8s clusters essentially are imitations of the OpenShift clusters, which should be fine for the near future, and that Antonio ran into HAProxy issues due to the peculiar use of certificates by IAM. Maarten clarifies the GUI has a certificate trusted by browsers, while the VOMS endpoint needs a certificate from an IGTF CA, and both are sitting behind a single port 443, with smart routing of client requests.

Stephan asks what is the deadline for sites to configure support of the K8s instances? Maarten answers the deadline in the tickets was April 10, but there still are many tens of unsolved tickets. Petr would like to start using the new instances for the SAM tests in the near future. Maarten asks if WLCG should set a deadline after which experiments may start doing that? It is decided to set May 31 as the deadline and Maarten will update all open tickets accordingly. Stephan adds that each experiment decides itself when to start using the new services. Petr agrees and adds that for ATLAS there will anyway be ad-hoc checks of the sites before critical tests start using the K8s instances. Broken sites will then be dealt with individually.

Stephan asks if the OpenShift instances will be switched off per experiment? Maarten confirms each experiment will indicate when it is ready for that step and that the old instances will not be immediately destroyed.

Stephan asks whether the future HA functionality will include client-side? Berk answers the HA will be invisible for the client, thanks to HAProxy or a similar solution, adding that currently, each IAM instance on K8s consists of a single cluster, while in the future there can hopefully be multiple clusters, or else a standby cluster that would be able to take over when needed. Stephan then asks if clients will retry as needed? Several point out it depends on the client. For example, voms-proxy-init will only do that when presented with multiple VOMS server hostnames, whereas our current intention is to have a single hostname for an intelligent load-balancer. Petr adds we should not look at how things were done in the past, but aim for such a load-balancer to have good knowledge about which back-end hosts work OK at any time. Berk adds that IAM already has readiness probes and that K8s currently depends on them. Stephan points out that such probes may have their own issues, as seen when one logs into a bad lxplus node that ought to have been taken out of the alias already. Maarten adds we can keep the OpenShift instances longer while we look into whether we have some questions about retry logic. Stephan posits that retries at a low level may avoid complexity at a higher level in various applications. Maarten observes that low-level retries may still fail, requiring the application to do something after all. Petr thinks the VOMS client retry logic can be taken advantage of for now, but that we should rather have the service made sufficiently reliable and stop bothering with client retries. Maarten concurs: we should not over-engineer the solution to deal with problems that should be rare.

Maarten adds that experiments should anyway be able to tolerate a 2-hour downtime of the service, which should be rare, but will probably happen, most likely due to something unplanned. Berk note that upgrades typically take O(10) minutes of downtime. Dimitrios asks if this would imply giving tokens longer lifetimes to keep things running? Maarten answers that such an approach should indeed be considered. Petr asks if we will change the lifetimes recommended in the WLCG Profile? Maarten replies we probably will need to change those recommendations, which were written in 2019 before we had any experience with tokens. Dave notes that at FNAL, a practical compromise is in production, viz. 3-hour tokens, regularly renewed, and still compatible with the 6-hour recommended maximum lifetime. He adds there is an expectation that the underlying system is HA by design, with multiple instances of critical components, etc.

Petr points out the VOMS-Admin transition timeline should be adjusted to make clear the K8s instances may start getting used much later than currently suggested. Maarten concurs and will update the document accordingly, with a new version number and date.

Next, Maarten reports he prepared a first version of a set of slides demonstrating the user registration steps from the user and VO admin perspectives, adding that some awkward aspects were discovered and are being looked into. Some are configuration matters, while a few IAM issues have been opened about others. Berk adds he has prepared a new template for the e-mail sent to the user when the registration has been approved, which invites the user to log into IAM through SSO, sign the AUP and add their certificate to their account. Maarten will try it out and update the training document accordingly.

Maarten reminds the meeting of the list of high-priority issues, which he thinks is too long, and invites the experiments to signal any that are deemed very high-priority, so that we may get the next IAM release to have fixes mainly for those. Petr suggests the most urgent issue may well be the lack of AUP reminders. Maarten replies it surely is an important one, but as a workaround, we still could set all AUP expiration times to be 1 year from now.

The next meeting would have been May 9, but that happens to be Ascension, which is a holiday at CERN and in several countries. Therefore, our next meeting will probably be on May 23 instead.

There are minutes attached to this event. Show them.

The agenda of this meeting is empty