Present: Angela CB, Tom D (Notes), John SDS, Dave D, Jim B, Stephan L, Maarten L, Matt D, Roberta M, Federica A, David K, Mine AC, JG
Apologies:
Concurrent with Data Challenge Daily Check-up, so will have lost some attendees to that.
- Data Challenge 24
- From the token perspective, a good success
- Not everything went as hoped, but successful transfers were completed. ~hundreds of thousands of files moved using only tokens across Atlas, CMS, and LHCb
- Ran into issues with configuration around involved services: Rucio, Dirac, FTS, and IAM
- High level pre-summary presented at yesterday's MB. A lot of debugging and tuning of most middleware had to be completed over the last two weeks. The pre-tests were useful, but not enough to have the DC run with everyone "reclining in their seats"
- Operations tuning needed
- For IAM: during the first week discovered interesting behaviours which are not fully understood. This should lead to investigation and discovery for what needs to be improved to avoid IAM being a bottleneck in the future
- Strong evidence that the usage of tokens in some workflows was sub-optimal, and it has been useful to learn and understand this in the context of the current Data Challenge. Learning where we hit the brick wall, and have the evidence to support this - prioritise certain improvements over others in the future.
- From here, we need to keep engaging with the experiments and intermediate services like FTS to ensure future usage of tokens is sustainable.
- It should not be rushed, thanks to persistence of voms proxies, but would be good for some services to move to using tokens in production. Understand how things need to scale
- IAM team in agreement of this sentiment
- Looking for a final push for Atlas and CMS to hit final targets for this week.
- At this stage, the Atlas FTS has stopped refreshing tokens to reduce the pressure on its hosts and DB. Atlas will fully rely on voms proxies to not have any performance limitations, and rather aim to push the network bandwidth to the desired levels.
- CMS have also recently stopped token refreshing, with the expectations that most transfers will happen within the token lifetime. They may use voms proxies if this is not the case, to avoid token hurdles here and to test the network side.
- Ultimately has been successful for learning the state of token integration, and lessons to take away for the next stage.
- Token profile
- Maarten looking to start committing the pull requests that are ready and easy to do so
- Still have some time for people to add further comments or objections - nothing is set in stone at this stage, and further issues may arise
- This should clean up the open issues and pull requests, and then send an email to the mailing list for a review
- The aim there is then to have a next version released early spring, following on from any dust settling from Data Challenge 24
- The changes recently made provide a notable upgrade to the 2019 version, though there are still trickier and more involved issues to be addressed beyond here
- Dave D asks are there any controversial ones to be discussed here
- Maarten highlights a particular one, which proposes discussion on the definition of the "stage" scope
- Needs quorate discussion as it will change the meaning of terms within the profile, and will need discussion from storage teams
- Another issue would be the case of token lifetimes
- Maarten highlights that the current profile defines lifetimes that are "unimplementable", as they would open significant operational problems - and so this must be adjusted
- Potentially may need to adjust the guidelines around what workflows are being used
- At the moment the document proposes a "one-size fits all" description, and this may not be tenable across the different workflows involved
- Changes here are non-trivial, and may take longer discussions to find an agreed-upon answer.
- Look to make a new release with other changes to the profile, before addressing these larger changes
- Dave asks for an example for situations where longer lived tokens may be suitable, if not for grid jobs
- Maarten suggests central services - areas where most of the load is generated, such as FTS. Being more flexible in these areas.
- This will depend on experience gained from DC24 and in the coming months
- It is possible that things may vary in their configuration and conclusions drawn are not necessarily the same for each experiment - as an example, the use of Rucio is very different between Atlas and CMS.
- Needs input and review from operations
- Dave notes that in their production, Vault allows long term access due to simple refresh processes
- Maarten agrees, and feels we need a better understanding of where best to use tools like Vault and MyToken to improve operational workflows without reducing security.
- Ultimately whatever usage of services such as Vault and MyToken, the token provider being available is the key breakpoint - and so even with shorter lifetimes, you may be pushing operational reliability and stability too far.
- Tokens will improve workflows and security a lot, but will need to aim for sufficient levels to allow operational stability
- Versioning for the Token Profile
- Currently there is a versioning on the token profile, with the guidance that if you do not recognise a version number you must reject it.
- This means that if we change the version number, we should also review this ruling
- This has resulted in some hard coded issues - for example, dCache requires a version of 1.0 and rejects the token otherwise
- This means that should we change the profile - wich we should, due to the number of changes - we must also insist that services become more intelligent when handling token versions
- This then leads to the review of whether we move to 2.0, as a major version - are we making backward incompatible changes
- Dave views this is the right step, as saying that we no longer assert rejecting unrecognised token profiles is a non-compatible change
- This should then have the statement that the new version is 2.0, and as of this we're going to allow minor revisions and extensions to the profile and that services should accept tokens implementing this
- This change will need changes across any services fully implementing 1.0, as by that rule they should reject any 2.0 version, but this is likely needed for allowing space for future 2.x versions etc
- Enrico notes that StoRM operates similar to dCache, and would need to have changes made
- Maarten notes that this means any 2.x or similar minor changes shouldn't be a problem from the service side, with services analysing the token and using what they understand and ignoring the rest
- Dave then notes that therefore any known incompatible changes should be made now, with a planned 2.0 change on the way
- Maarten links this back to the staging issue, where Michael from the CTA team has pointed out that stage and read are separate concepts, whereas at the moment stage is just an extension of read.
- The proposed change would prevent the same token being able to be used to perfom both stage and read actions
- Maarten will review and clear the simple backlog, and then email the WG to clarify that this is close to what should be released for 2.0 and request reviews.
- This should resolve the staging issue, so as to make sure that this is included in a major version change
- Stephan comments the updated profile should improve the description of the Token Audience
- quoting from an e-mail sent in January: "we should add a section to the WLCG profile definition on audience handling (string match or better) and guidance of selecting audiences (hostnames vs URIs)."
- Next, he points out another issue that needs to be clarified
- Some confusion around what to do if there was no matching scope found, and what this means for the storage sites - at the moment, some implementations fall back to the VO defaults
- Needs clarification that if there is no matching required scope, there should be no access granted
- The example being that if there is no matching read scope for CMS, but have a CMS token, it may default to read access - even if there is no storage.read scope
- No current issue for this, Maarten will look into this
- This may be somewhat resolved with the decision to consider scopes over groups, meaning that if there are scopes present, then services should ignore any groups present, as was proposed and implemented by dCache. This change has already been made to the trunk of the document
- Stephan volunteers to check and verify if this covers what is needed, and make a suggestion if needed
- CERN IT meeting to overview voms admin to IAM transition
- This covered the complications of the transition, and the planned timeline as well as the technical aspects and training needed
- The plan is for this to be published on a GoogleDoc, to then be discussed at the March 7 Operations Coordination meeting, after having discussed it first with the "usual suspects"
- Concerns that people are not on the same page - this has been clarified internally, and should now be published for others to follow.
- This step is needed for VOs to become completely dependent on IAM
- Not very critically dependent on IAM ATM, but needs to be understood that when the move to IAM for VO management is made, there is no way back.
- This then needs careful consideration ot ensure to have the most code and deployment improvements possible. This includes the support situation for the service as well.
- John: had some concerns following Berk's talk (in the ATLAS Jamboree), notably the need to move platforms as this would be disruptive and require sites to take action
- Maarten: some good news, plan and think that we can have OpenShift instances and kubernetes clusters operating alongside of each other.
- testing has been done by Berk, to have 2 independent container clusters communicating with the same DB.
- This means the shift from OpenShift to K8s is decoupled from the end of life of voms admin - there is some flexibility here.
- K8s move has a number of benefits:
- More control over performance
- better logging and monitoring
- better gitops integration
- mirrors cnaf setup
- Can predict the hostnames we will see on K8s, and can therefore do a deployment campaign in ~early March. this could then add extra issuers and voms endpoints as needed, and not used until things change over but they are configured
- John notes that this is promising, and less disruptive as intially sounded.
- Stephan asks is this possible to do with an IP alias?
- Maarten says this was investigated but would involve a nasty DNS hack because OpenShift exists in its own sub-domain - it is not just the hostname that changes, but the URL, and thus would need a forwarder to still exist within OpenShift
- Stephan notes that whilst it would be transparent, it would be good to remove it from site configurations so as to not have a dead authorisation endpoint registered so as to prevent it being reused by anyone at CERN
- The aim is to only have one switch rather than two, with a second change to remove the old configuration
- Some suggestions here - Stephan and Maarten to discuss further offline.
There are minutes attached to this event.
Show them.