Previous Actions:
Proposed agenda:
WLCG transition from X.509 to Tokens: Progress and Outlook
Since 2017, the Worldwide LHC Computing Grid (WLCG) has been working towards enabling token-based authentication and authorization throughout its entire middleware stack.
Taking guidance from the WLCG Token Transition Timeline, published in 2022, substantial progress has been achieved not only in making middleware compatible with the use of tokens, but also in understanding the limitations of the WLCG Common JWT Profiles, first published in 2019. Significant scalability experience has been gained from Data Challenge 2024, during which millions of files were transferred with only tokens used as credentials.
Besides describing the state of affairs in the transition to tokens, revisions to the WLCG token profile, and the evolving roadmaps, this contribution also covers the corresponding transition from VOMS-Admin to INDIGO-IAM services, with continuing improvements in terms of functionality as well as deployment.
Zoom meeting:
Link below, in the videoconference section. Please ensure you are signed in to Indico to see the meeting password!
Next Meeting:
Present: Angela, Berk, Dimitrios, Enrico, Hannah, John, Maarten (notes), Petr, Roberta
Apologies: Linda, Dave K, Tom
Notes:
Enrico reports there should hopefully the final 1.10.1 release candidate this week, with some bugfixes, an AngularJS update and no DB schema change beyond the one already included in 1.10.0. The release after 1.10.1 will also contain various interesting improvements. For example:
As examples of ongoing work in development, Enrico names the upgrade to Spring Boot v2.7 in preparation for v3 and support for the expert read role requested by ATLAS. He adds the next release will not happen before the end of September.
Maarten points out we are becoming keen to see a version that no longer stores the access tokens in the DB, as we try to validate a new model for large-scale FTS transfers. Enrico replies that such a change requires various MITREid classes to be moved into the IAM code, to allow the layer between the DB and the code to be redefined. Those changes in turn depend on Spring Boot being already on v3, to avoid a duplication of the work later. The current planning foresees all that to be ready by the end of the year and the work has in fact already started.
Maarten suggests that a short-term hack would be acceptable as a short-term solution, while a proper implementation could come later, but the developers would have to be the judge of what is feasible. He explains we would have liked not to have any concerns about the scalability of the DB while we ramp up those data management tests. He adds that the experiments should switch to the K8s instances by early or mid October, after which the OpenShift instances would be given their own DB instances to allow stress tests to be run separately from production.
Dimitrios starts describing the ATLAS tests and asks if there can be a way for the DB to be easily cleaned up when needed? Enrico replies that such unscheduled cleanups are expensive and could keep some tables locked for a while, adding that it would be best to rely on the automatic cleanup. Furthermore, DC24 showed no problems with O(1M) tokens in the DB after the necessary indexes had been defined. He suspects the DB will be able to handle at least up to a few million tokens. Maarten adds the DB currently has 1.3 M tokens and everything looks fine.
Dimitrios continues, pointing out the 2 concerns from DC24 that the current tests are addressing: 1) the overloads due to token refresh operations by the FTS, and 2) the inclusion of the modify scope in tokens usable for large parts of the namespace. The current tests have file-specific tokens with lifetimes of 2 weeks and no refreshing (nor token exchange) by FTS. If a token runs out, its transfer just fails and the ball is handed back to Rucio. The rates have typically been 1-2 Hz and occasionally up to 5 Hz, when a big task gets injected. There are about 15 sites involved, all served by the CERN FTS. Enrico asks how many tokens might be active at the same time? Dimitrios answers it depends on the load and the sites, but that he expects the maximum to be less than 10x the current values.
Dimitrios then brings up for discussion the fact that some sites had to be excluded because their SEs were found not to implement the WLCG profile correctly. There were these problems found: 1) the modify scope does not allow creation of parent directories, and 2) tokens without scopes could be used to delete files! He adds it is difficult to check for such misbehaviors and that sites may have inconsistent configurations. Maarten replies that these matters are orthogonal to the use of tokens, but more likely to occur with the rather new technology of tokens compared to the way things were done in the last 20 years, and that we would e.g. need to enhance the SAM tests to catch such issues early. Dimitrios adds there is an opportunity for cross-communication between the experiments, as they can all be affected similarly. Petr adds that Stephan Lammel drew attention to such issues at the end of last year and that CMS also have "negative" SAM tests, where access failures are the expected result because of the lack of a proper credential for the attempted operation. He concludes it is up to us to test the SEs and help avoid misconfigurations at sites.
Next, Berk reports he will test the upcoming 1.10.1 release candidate and that the upgrade of an instance should involve less than 10 minutes of downtime. As the first week of September has a CERN holiday on Thursday, it is decided to plan the upgrades for Monday Sep 9, which also plays well with the ATLAS tests, which are switched off during weekends and will then be restarted when all looks good after the upgrades.
Next, Hannah draws attention to the CHEP presentation, a very first sketch of which is linked to the agenda, with a rough outline of the topics we would like to see covered. People can contribute and/or wait for actual drafts to be reviewed over the next month and a half.
Finally, Hannah announces the CERN IAM team has been allowed to open a post for a technical student to work on CERN-specific feature development, hopefully starting as of January.