[Enrico Fermi Institute] 12:01:05
So it is 11. We're going to have a short present presentation.
[Enrico Fermi Institute] 12:01:11
From? What are you out there
[Wahid Bhimji] 12:01:13
Yes. Hello!
[Wahid Bhimji] 12:01:18
Yeah, hold on. I'll just move my room right now.
[Enrico Fermi Institute] 12:01:18
yeah, okay.
[Wahid Bhimji] 12:01:22
I'm just gonna move into a meeting room
[Enrico Fermi Institute] 12:01:25
So you had your workshop. Now, where are you discussing this 11
[Wahid Bhimji] 12:01:29
just 10, Yes, Yeah, we don't quite that far ahead.
[Enrico Fermi Institute] 12:01:30
Oh, that's 10. I got the number
[Wahid Bhimji] 12:01:36
Yeah, So it's good good timing to have this conversation.
[Wahid Bhimji] 12:01:39
Actually So yeah. So I have a few slides.
[Wahid Bhimji] 12:01:46
I don't. I don't necessarily need to talk to them.
[Wahid Bhimji] 12:01:50
I wasn't sure if you wanted slides or not.
[Enrico Fermi Institute] 12:01:56
you want to share? Can you allow sharing, or the below
[Enrico Fermi Institute] 12:01:59
Are you allowed to share
[Wahid Bhimji] 12:02:02
Yeah, I think so. Well, it hasn't hang on.
[Enrico Fermi Institute] 12:02:03
Oh!
[Wahid Bhimji] 12:02:05
I'm just
[Enrico Fermi Institute] 12:02:06
Great
[Wahid Bhimji] 12:02:09
Let's see, Does that work? You see some window? Yeah, let's see if slide show mood messes it up.
[Enrico Fermi Institute] 12:02:17
Yes, yes.
[Enrico Fermi Institute] 12:02:20
Right.
[Wahid Bhimji] 12:02:23
so. I mean, this is actually just based on some slides I did show at the Cce meeting, or Debbie showed them.
[Wahid Bhimji] 12:02:29
So there's no particular news, hey, here, but just to share, like, just to set the context.
[Enrico Fermi Institute] 12:02:34
Thank you.
[Wahid Bhimji] 12:02:36
And then we can just talk about, you know, whatever you want to talk about.
[Wahid Bhimji] 12:02:38
I guess so. This is the current state. Oh, of nurse system.
[Wahid Bhimji] 12:02:46
So this shouldn't actually save P. One. Now we have the full of perimeter, both the a 100 accelerated Gpu notes, and Cpu only nodes.
[Wahid Bhimji] 12:02:59
but this is still not quite yet in production, as was mentioned briefly earlier, We do have some file system problems in the last stage of kind of upgrading them to use this new singshot high-speed, interconnect there's been a few snags I guess so but
[Wahid Bhimji] 12:03:15
those are being resolved, and I'd say it's probably within a month of being at the point of 40 fully available in production.
[Wahid Bhimji] 12:03:25
Kind of mode, and, as you probably know, we're gonna we've so far it's been in and early science kind of free, mode, where you don't have to use your allocation in in order. To use.
[Wahid Bhimji] 12:03:35
It. But that's coming soon, and then we still have Corey in production, and that is the the main production machine at the moment, and the goal is to retire that at the start of next Yeah, yeah, pending Pomato, actually being fully in production.
[Wahid Bhimji] 12:03:58
and so, yeah, there's just a comment here that assign that we do, you know.
[Wahid Bhimji] 12:04:06
Look at what user requirements are, while in order to get increase computing resources, you know, it is necessary to move to accelerated notes, as the only way we could offer the kind of increasing performance we need from this machine over the previous machine we do recognize that many communities are not ready for using Gpus for all
[Wahid Bhimji] 12:04:26
of their workload. And so that's why there are.
[Wahid Bhimji] 12:04:29
Cpu only nodes that actually provide all of the capable ability of Corey.
[Wahid Bhimji] 12:04:35
In the these notes. Okay, Yeah, So that's system.
[Wahid Bhimji] 12:04:41
This is a bit more of kind of where we're going.
[Wahid Bhimji] 12:04:43
We're only gonna have boma to. So there's a bit more detailed on the Cpu notes here, And then, just to say it was on the previous slide as well.
[Wahid Bhimji] 12:04:51
But as these file systems that made available, and also, we do put a kind of focus in having connections with external facilities, including other Hpc centers as well, as you know, science for facilities
[Wahid Bhimji] 12:05:08
Okay, And then there's been, you know, we've showed this many times that we had this super facility project, And this was really about trying to improve the engagement with data intensive workloads that also need workflow services running alongside that so we have an infrastructure that's
[Wahid Bhimji] 12:05:23
kubernetes based for services on the side we have.
[Wahid Bhimji] 12:05:27
You know, we put focus in things like Jupiter notebooks that can also run on the big machines, and we're really pushing for federated identity.
[Wahid Bhimji] 12:05:35
I mean, that's kind of rolled out now that you can use credentials from of the places to access Nesk.
[Wahid Bhimji] 12:05:43
Assuming you have an desk account now, so you kind of put the 2 and hopefully that will be pushed out, and that's come.
[Wahid Bhimji] 12:05:50
One of the months later. As part of this infrastructure, integrated research infrastructure task force which is trying to really get kind of cooperation across different centers for these these.
[Wahid Bhimji] 12:06:04
Things. So that's just the example with, you know.
[Enrico Fermi Institute] 12:06:07
Please.
[Wahid Bhimji] 12:06:07
Have type. Workflow Lz. They make you know that we are the primary center for them, and they only center in the Us.
[Wahid Bhimji] 12:06:16
So They really have to have all aspects of their workflow working well and desk, and takes a lot of engagement to achieve.
[Wahid Bhimji] 12:06:26
I guess this is saying, okay. So we we engage with scientists in lots of ways.
[Wahid Bhimji] 12:06:32
So there's a kneesap program, and you know, and listen.
[Wahid Bhimji] 12:06:35
Cms. Are both parts of that that help with, you know, can help provide resources to to help to new architectures, and also to explore Ai methods, which is really also a way of using Gpu resources Bo as well.
[Wahid Bhimji] 12:06:51
As having it same benefits in terms of transformative change to the way science works, And then we also have the superivity project that is trying to build more workflow stuff so in the future nose turn that I'm just mentioning We have a workshop about Now, internally that
[Wahid Bhimji] 12:07:08
we're trying to. It has achieved CD 0.
[Wahid Bhimji] 12:07:10
So that means there's a mission need for it. Then we're really putting together an Rfp.
[Wahid Bhimji] 12:07:15
Now, which will go out to vendors to kind of bid for a machine here to provide us with the machine.
[Wahid Bhimji] 12:07:21
So that's the stage. It's at and and part of the way this has been phrased.
[Wahid Bhimji] 12:07:25
The mission need is that that we need a machine to support workflow rather than just applications.
[Wahid Bhimji] 12:07:32
So I think that helps the experimental hep community as well, And then I briefly mentioned this thing: The integrated research, infrastructure effort.
[Wahid Bhimji] 12:07:41
That is another. Do we wide effort to to build workflow technologies and and support different sentences?
[Wahid Bhimji] 12:07:51
I guess this is just the there's 10 mission statement here.
[Wahid Bhimji] 12:07:56
Probably there's nothing new you there, and this is just staying again that we expect this machine to really stretch out into Es net and other places, and and provide, you know, way people can run stuff using data from outside.
[Wahid Bhimji] 12:08:13
then I just briefly wanted to mention the these. Yes, sure.
[Enrico Fermi Institute] 12:08:16
That's a good question on that slide. So that means essentially streaming.
[Enrico Fermi Institute] 12:08:23
Then also streaming in and streaming out
[Wahid Bhimji] 12:08:25
Yes, So that that comment was made earlier. And there are various use cases not just tep who want to do their including the light sources like you mentioned.
[Wahid Bhimji] 12:08:37
so we do anticipate supporting that better in principle.
[Wahid Bhimji] 12:08:42
It should be already much better on permanent than it was on Corey.
[Wahid Bhimji] 12:08:44
I mean, yeah, don't mention the problems we've had on Corey, which really are never being properly resolved Poem. It already.
[Enrico Fermi Institute] 12:08:46
Okay.
[Wahid Bhimji] 12:08:53
Should have better capabilities to do this
[Enrico Fermi Institute] 12:08:59
Okay, Great: Thanks.
[Wahid Bhimji] 12:09:02
Okay, this is just a couple of sides of context as well about you know, the landscape as a whole is getting increasingly challenging with heterogeneity in some ways, there may be advanced from this is the grayson and video grace hopper architecture which has cpus and
[Wahid Bhimji] 12:09:20
G. If you use with, you know with you know, access to memory with them.
[Enrico Fermi Institute] 12:09:22
Yeah.
[Wahid Bhimji] 12:09:26
So in in some sense, this could reduce on data movement costs, and so in some make this easier to program than current architectures.
[Wahid Bhimji] 12:09:36
But, on the other hand, this grace is a is an arm Cpu, so there's some, you know already, some differences.
[Wahid Bhimji] 12:09:42
There and then. There's also this move to its triplets.
[Wahid Bhimji] 12:09:48
Amd, for example, having all kinds of different calls on, there.
[Wahid Bhimji] 12:09:50
that's Dpu: So programming and network.
[Wahid Bhimji] 12:09:54
And then there's all these Ai hardware, specific architectures, and then a bit longer term There's the idea of processing in storage, and there's also a move we see on the nurse 10 time frame just kind of coming in towards disaggregation which
[Wahid Bhimji] 12:10:09
potentially allows more efficient use of resources. So this is the idea that you could have a a disaggregated memory pool which gives you increased memory capacity, but not on the note, So you would you'd be incorporating memory from outside the note but that means that people who need
[Wahid Bhimji] 12:10:27
much higher memory capacity would be actually be able to access that without I was having to buy kind of that in every single node So there's opportunities here.
[Wahid Bhimji] 12:10:37
But also quite complex landscape. And then you know, there's this rise of the cloud market that really is driving everything so you know, it's very lightly that so this is an opportunity of course because we can capitalize on all this investment going into cloud interfaces and so forth but it means
[Wahid Bhimji] 12:10:56
that you know we also have to recognize that in the kind of machines that we have access to so, and we can expect that these interfaces will become the standard way of accessing machine.
[Wahid Bhimji] 12:11:11
So. So this is also good. I think it means that if you use these cloud interfaces, then there's, you know, probably a good expectation that these should be what we.
[Wahid Bhimji] 12:11:25
Should we should definitely work with the other compute centers to make sure these are well supported at the very compute centers
[Wahid Bhimji] 12:11:34
and this is just one slide on, I mean, since this was the Hpc.
[Wahid Bhimji] 12:11:38
And Cloud workshop. I just thought about this like that. You know.
[Enrico Fermi Institute] 12:11:39
Yes.
[Wahid Bhimji] 12:11:41
We're already kind of using that. In there, as mentioned in the spin services that set on the side.
[Wahid Bhimji] 12:11:46
But we're increasingly seeing a tighter integration to the main system.
[Wahid Bhimji] 12:11:51
And so I expect on Nurse 10 there'll be an increasing ability to use cloud type interfaces to access the big supercomputing resources as well.
[Enrico Fermi Institute] 12:12:04
Okay.
[Wahid Bhimji] 12:12:04
No, it's good. At least. Okay. So I think that's all I really had.
[Wahid Bhimji] 12:12:10
This one is just about also data management. I think we see also having an increased role here, in the nurse town timeframe, which I think should also open its community But again, and then this is probably a general point without a thought as the discussion was going on earlier that we do have to cater for a very
[Wahid Bhimji] 12:12:29
wide community. So that's one of the maybe disadvantages we have compared to the leadership computing facilities that we do try to support different user communities.
[Wahid Bhimji] 12:12:39
But we have, you know, thousands of users and several 100 projects that have different needs.
[Wahid Bhimji] 12:12:46
Some of them are traditional. Hpc Center Hpc projects so they need, you know, tightly coupled, large scale resources.
[Wahid Bhimji] 12:12:54
some are more similar to experimental help, but have their own.
[Wahid Bhimji] 12:13:00
You know their own ways of doing things there like a little bit different to how experimental help is doing it And so we have to kind of come to some sort of balance of supporting all of these.
[Wahid Bhimji] 12:13:12
Okay, Okay, I think that's me. Yeah, any, any questions.
[Enrico Fermi Institute] 12:13:18
Thanks for you. I have one question. I think you mentioned that the nurse 10 is going to have a lot of you know, accelerators, for performance and things like that.
[Enrico Fermi Institute] 12:13:29
Yeah, what do you do? You guys have any feeling for what the mix will be?
[Enrico Fermi Institute] 12:13:34
Of of accelerators and cpus. And the next machine.
[Wahid Bhimji] 12:13:39
Oh, well, we don't! And we're having that discussion So so one thing and and this, this this other things that might come into play here as well, because so I mean, I think you can guarantee that there will be some gpus in this machine pretty pretty.
[Enrico Fermi Institute] 12:13:41
Okay.
[Enrico Fermi Institute] 12:13:41
Yeah.
[Wahid Bhimji] 12:13:54
Much realistically. That will be the most slightly, you know, generally usable accelerator.
[Wahid Bhimji] 12:13:59
That's right. Today. Then I mentioned there was these disaggregation technologies, and also several of the vendors are talking about like multi-tenancy, and so forth.
[Wahid Bhimji] 12:14:10
So. It is possible that you know one could run the Cpu only workload a lot died without any dedicated cpu, any nodes, so that would be a judgment on whether that technology really allows that and whether it that would provide sufficient, resources.
[Wahid Bhimji] 12:14:31
So those codes that was super gpu heavy, and accelerated would like leave enough of the Cpu, too, allow other jobs to run on there that you, Cpu, only but anyway, it is that certain part of the community even on the 2026 time scale won't be ready
[Wahid Bhimji] 12:14:50
for accelerated only so, you know, they will continue to be some Cpu resource, and then on the more exotic accelerators, I think it it is likely that we will have Yeah, Well, we will in the rfp have some place that people can which ai
[Enrico Fermi Institute] 12:14:52
Okay.
[Wahid Bhimji] 12:15:10
accelerators. For example, you know whether those are offering a significant benefit above Gpus, I don't think is yet clear.
[Wahid Bhimji] 12:15:20
A minute. I don't think they particularly are now, but they may do.
[Wahid Bhimji] 12:15:24
On the 2026 time scale. But there the Ai workload, you know, is currently not a very big fraction of what we're running, and so it would, you know, have to be sized, accordingly and I would say, on.
[Wahid Bhimji] 12:15:38
The integration with cloud. We're also looking at. As the point was made earlier, There's a huge variety of technology on the cloud, and even though we we tried to deploy cutting edge technology you know, Obviously, there quicker, to deploy various new technologies, so it.
[Wahid Bhimji] 12:15:53
May be that we can, You know, partner, with cloud providers to provide some of this capability for experiments and that particular workloads that need to run on different accelerators
[Enrico Fermi Institute] 12:16:11
But he would it be fair to say that we shouldn't expect significant scale up of the Cpu because I look at what Corey, the pro mode the Cpu basically stayed pretty much flat, one less because you the cpu fraction of parameters it somewhat equivalent of performance
[Wahid Bhimji] 12:16:21
Hmm.
[Enrico Fermi Institute] 12:16:30
to to what what we had on Corey. And just because of power budget reasons, I I wouldn't expect that, like 10 gives us 3 times the cpu.
[Enrico Fermi Institute] 12:16:40
That's in problem. I don't know. Probably most of yeah, okay.
[Wahid Bhimji] 12:16:41
right.
[Wahid Bhimji] 12:16:42
Right? Yeah, I think that's in terms of Cpu only resources.
[Wahid Bhimji] 12:16:47
I think that would be a reasonable expectation
[Enrico Fermi Institute] 12:16:50
Good.
[Enrico Fermi Institute] 12:16:58
Other questions for what you more short-term technical one.
[Enrico Fermi Institute] 12:17:06
So for data transferring out globus is not the be all at end.
[Enrico Fermi Institute] 12:17:10
All for Lhc. I know that there was some work to do, something with X room
[Wahid Bhimji] 12:17:17
Yeah, So that's still ongoing. I mean.
[Enrico Fermi Institute] 12:17:20
How's that? How's that going
[Wahid Bhimji] 12:17:23
Well, I I mean, I think yeah, we still working on it, right?
[Wahid Bhimji] 12:17:28
I mean, it's got a bit slower now, but I think we are trying to do that, and I think it would particularly of both Atlas and Cms can use the same interface and also other.
[Wahid Bhimji] 12:17:39
You know, help experiments, and even potentially, the light sources.
[Wahid Bhimji] 12:17:42
Then it's something worth us putting effort into support. I also think we need to.
[Wahid Bhimji] 12:17:49
So at the moment the spin kind of these containerized services haven't been like optimized for using for data management, services, but I think that's another thing that we should be able to support in the longer run that will add up people to run all kinds of different things on that side I mean globus
[Wahid Bhimji] 12:18:08
is for us the best, you know. Multi. It was supported by the most number of other communities that it's really worth us putting an effort into support.
[Wahid Bhimji] 12:18:20
But yeah, I do appreciate that. Not everyone uses it.
[Wahid Bhimji] 12:18:23
And so we do need other other things. I did have a brief chat.
[Wahid Bhimji] 12:18:26
I saw I am Foster actually conference a couple of weeks ago, and so that did have a brief chat with him.
[Wahid Bhimji] 12:18:34
I think power is always well, but about about ways we can maybe improve global and D kind of interoperation.
[Wahid Bhimji] 12:18:43
but that was no more than a chat at this point, but he seemed open to more discussions on that front
[Enrico Fermi Institute] 12:18:52
And it's probably not, for with this talk back going, we can chat later on that we're we're like.
[Enrico Fermi Institute] 12:18:59
Technically, we were stuck on Thursday things. But it's yeah for we can talk over.
[Enrico Fermi Institute] 12:19:07
One should be in time. Okay, other questions for anybody else.
[Enrico Fermi Institute] 12:19:16
Anybody on zoom
[Enrico Fermi Institute] 12:19:24
By the way, just to we had this plan for the afternoon, for the Hpc focus area but due to the ongoing workshop there was a little bit of a scheduling conflict here.
[Enrico Fermi Institute] 12:19:35
So we okay, alright.
[Wahid Bhimji] 12:19:35
Yeah, so I won't be around in the afternoon. So if you wanna attack me, you should do it now and but yeah, we'll be interested in also seeing the blueprint.
[Enrico Fermi Institute] 12:19:42
considering.
[Wahid Bhimji] 12:19:45
As well once you have it, or whatever, because I think that will help, you know, as was mentioned
[Enrico Fermi Institute] 12:19:49
I'm not. I mean the level. It's probably not gonna be fully public, but there might be a version of it that's going to be public.
[Wahid Bhimji] 12:19:57
Right? Yeah, I mean again for influencing kind of architectural decisions.
[Enrico Fermi Institute] 12:19:57
We'll have to see what
[Wahid Bhimji] 12:20:02
I mean, it's really when we're evaluating the Rfp.
[Wahid Bhimji] 12:20:05
And stuff that we can bring in these considerations
[Enrico Fermi Institute] 12:20:09
So? Are you looking at things like the very low power course like our?
[Wahid Bhimji] 12:20:14
Yeah, I mean, you know, in video, want to sell you this and this great hopper architecture now.
[Wahid Bhimji] 12:20:22
So they're setting arm cpus with the so at least with the Gpu.
[Wahid Bhimji] 12:20:28
So, at least for the Gpu accelerated notes.
[Wahid Bhimji] 12:20:31
If they're Nvidia, then it would they would be on, and they also sell Cpu only, or will do so
[Enrico Fermi Institute] 12:20:39
That's which
[Wahid Bhimji] 12:20:43