Present: Dimitrios C, Tom D (notes), Matthew D, Dave K, Petr V, Federica A, Roberta M, Berk B, Dave D, Angela C-B, Maarten L, Enrico V, Stephan L
Apologies: John SdS
Previous Actions:
- All to send high priority issues to the mailing list -> Enrico to create board for 1.11 release including those issues (not experiment specific but specifying initial requestor)
- Next call to focus on JWT Common Profile improvements for v 2.0
- Maarten to send email to working groups to ask for consensus on v 2.0 profile (allows developers to progress)
- CHEP paper drafting - see email list.
Proposed agenda:
Minutes:
- Recap of Last meeting:
- Discussed PR, and one "easy" one ticked off
- Maarten will follow up with circulating for review.
- Other issues may be more involved, and may not converge as quickly.
- Proposal is to complete and pull in the existing request, before then releasing a 2.0 version. This can then be followed up on with resolving of other issues.
- Current focus: Finalising move from OpenShift to Kubernetes
- This is more important now due to CentOS 7 components within OpenShift
- Current self-imposed deadline is the 30th January
- This looks OK for: ALICE, ATLAS, and LHCb
- Hopefully possible for CMS also
- CHEP Paper:
- Currently in progress by Tom on Overleaf:
- Contributions welcome
- SLA Concerns
- Maarten has been communicating that these services should not be anticipated full 24/7 - rather, perhaps, 8/7.
- The idea is that if the IAM is being utilised for time-critical processes, there is some degree of risk. The services are highly available, but still further information required around token issue rates
- OTF meeting in November had a focus on tokens.
- Discussions around support levels - confirming that services will be highly available, but if breakages happen there may be delays - can the experiment take that hit.
- Questions around token rates also - Berk has managed tests of ~900Hz, but requirements of up to 5-10kHz have been expressed.
- Testing generation of own tokens using scripts - ALICE has been doing this, and ATLAS and LHCb will test this. The Data management services will create their own tokens, to meet the rates required whilst also circumventing time critical dependencies
- CMS will continue with the existing model, as they have lower requirements.
- Dave D had comments on the availability aspects:
- In the case of submission of tokens for pilot jobs, whether there is a potential to bring the lifetimes down to shorter values - aiming for the 6 hour window.
- If the issuer is highly-available, you can probably tolerate shorter windows, especially if working towards 8x7 operations - with some degree of regular checks
- Maarten replies:
- Does not see an issue with multi-day pilot jobs, as the tokens are not delegated out to the job and are already in locations where other high-risk routes exist.
- However, the issue is that when an issuer goes down, the next second jobs needing fine-grained tokens will start to fail, which can then cause an avalanche of failures
- Dave adds:
- Data access is a different use case here, compared to situations where you have refresh tokens and for pilot job submissions
- Dave highlights that it bothers him that from the beginning the long-lived tokens were intended to be a stop-gap measure, due to lacking high availability. The current setup should be designed for high availability and not without it. If the infrastructure is designed so that tokens can be refreshed when in the job, they can have some hours left in the case when the issuer goes down - and we should look to address this.
- Maarten:
- It is an optimisation problem. For some extent, experiments should not care where a token comes from, as long as it should be rare that a big downtime leads to a big fallout. Experiments should do individual risk assessments here.
- Dave:
- Acknowledges that if Rucio & Dirac needs lots of fine-grain tokens, this is a different issue and they cannot depend on IAM rates.
- But the rest of the system should be designed for HA, and not around expectations that the issuer will go down.
- Maarten:
- There is a lack of resources for things to be "perfect". And ultimately will boil down to where we can take the - rare - hit.
- Kubernetes instances of IAM have multiple IAM instances on different clusters, giving HA from the IAM perspective. Next step is to understand HA for the database - this is a more complex process, and can introduce new failure modes.
- ALICE, ATLAS, and LHCb look to be ready to migrate to the Kubernetes instances - primarily just hostname changes, and services in WLCG have been configured to accept tokens from these new versions.
- This leads to reduced worry availability of the IAM instances.
- Maarten is happy that different approaches are being attempted in parallel - multiple islands of stability.
- An overall optimisation problem, bigger than just IAM - and may have different solutions for different experiments.
- Still in an R&D phase.
- Dave:
- Would like to see that - aside from special noted cases where we need very fine-grained tokens - we should aim for six hour tokens
- Maarten:
- Part of an overall security risk assessment, and different lifetimes can be tolerated depending on the use case in question.
- Other controls can be used - such as limiting a token scope to a single file. This means that whilst damage can be done if it is stolen, how much damage is much more limited - a calculated risk
- These well depicted use cases and effects, are part of the mission of the Token Trust & Traceability Working Group - identify and produce recipes that are proven to work and have been discussed in terms of security.
- Dave noted that the WLCG 2.0 profile identifies a max of 6 hours
- Maarten replies that there should be an open issue about this to update this to be a fully correct statement. Instead should be viewed as a default to aim for, but depending on the exact use case and scenario, deviations may be permissible
- Stephan shares some concerns about IAM support levels
- Stephan:
- CMS is unhappy with the support levels for IAM, provided by CERN.
- It is unacceptable that CMS cannot alert someone in the case if IAM is down, due to its role as a critical central service for experiments
- CMS would like to see proper support coverage for IAM. Otherwise, 48hr token lifetimes are needed if there is no weekend support
- Stephen also highlights that calling it currently a "R&D" phase is not acceptable, as CMS has been using tokens in production for a while now
- Maarten:
- Highlights that 24/7 support will not be feasible due to costs and effort involved
- The service is therefore instead aiming for High Available, and as close to 100% availability as possible
- Stephan:
- Compares the difference between one ping a year to a flaky, regularly pinging service.
- Need for a SLA that highlights response time as critical - this doesn't need to be 15 minutes & at midnight, but perhaps looked at until 22:00 and picked up again at 07:00 the following morning. As it current stands in 8x5, it could lead to 2 days downtime through weekends or holiday.
- Also highlights that it does not need to be expert level support at all times - for example, it could be someone who could investigate with a simple reboot and then escalate as necessary - perhaps something an operator can do.
- Formalising 8x7 or similar will help a lot, and help the experiments with planning, particularly with lifetimes.
- Maarten:
- I will look to raise this discussion again, in a previous thread from Nov/Dec about these concerns, and understand what CERN IT can provide. Highlights the aim is to build trust in IAM. The issue has not gone away, and there are fresh arguments to bring to the discussion.
- Petr had a request for an IAM enhancement:
- Had an incident over Christmas with one user
- Different group of people watching the infrastructure, and this person may not have full IAM admin privileges
- Perhaps should need to have an operation role within IAM so that users can be disabled immediately, if needed. The result would be more granularity in privileges.
- Has been discussed in the CERN IAM coordination list, and further discussion to be required.
- Potentially a topic for the next IAM hackathon.
- Enrico will follow the thread to understand requirements, to better prep for the IAM hackathon discussions.
- INDIGO IAM Hackathon: Feb 2024
There are minutes attached to this event.
Show them.