AFTERNOON SESSION
(Eastern Time)
[Enrico Fermi Institute] 14:00:40
we're just getting back into the room here and getting started again.
[Enrico Fermi Institute] 14:00:44
So now we're starting the Hpc focus area.
[Enrico Fermi Institute] 14:00:49
Block: Yeah, thanks. Yeah, So we can jump right into it here.
[Enrico Fermi Institute] 14:00:55
Okay, see, the people are, are rejoining inside. Yeah, So this afternoon we have the Hbc focus there.
[Enrico Fermi Institute] 14:01:04
So we we already did quite a bit of discussions, but the hope is that we kind of maybe go a little deep on certain type of topics, and we also have some maybe some questions and points for discussion that Rand brought up yet So this is just a redo on maybe a little bit deeper than than on the
[Enrico Fermi Institute] 14:01:22
introduction slide on? What? Basically what? We're targeting And the separation of the user focus facilities and Lcfs: Maybe one thing here on the user focus facility that maybe has an been discussed a lot is where this is going for the Nsf funded Hbc: if they stay
[Enrico Fermi Institute] 14:01:50
on Cpu only, or whether they will also follow the transition to Gpu, because so far they're pretty much follow their users.
[Enrico Fermi Institute] 14:02:02
They have a few gpus on the side for training and and test out, but it's usually it's not the bulk of the facility and nurse has made that switch with the transition, phone court to armada.
[Enrico Fermi Institute] 14:02:15
So we have to worry about the same switch happening in the Nsf facilities at some point they have the same power constraints, probably not because they're smaller facilities, but I mean they're also they're also getting larger.
[Enrico Fermi Institute] 14:02:37
Right, do you have any input on that question which question the What about the next generation of Nsf Funded: Hpc: Do we have to worry about making the transition?
[Enrico Fermi Institute] 14:02:46
To Gpu to stay on Cpu and follow with him. Oh, users, there's always gonna be big big hunk in Cpu machine.
[Enrico Fermi Institute] 14:02:56
so I don't think Andville expanse or or any sort of outfiters.
[Enrico Fermi Institute] 14:03:06
Okay, bye past that. You know it comes. Question was like, Do you believe what Nsf.
[Enrico Fermi Institute] 14:03:14
Spend authorized by Congress? Or do you believe what they've been appropriated by Congress?
[Enrico Fermi Institute] 14:03:19
So some of the big expansions, you know, would allow a leadership class facility on the Msu side, and that would be for a lot of the same reasons on the Ue.
[Enrico Fermi Institute] 14:03:39
Side very on. The other end. So if you, can't believe that's that's done, and it's gonna happen then, Yeah, there's there's gonna be a big honking, heavy Gpu: machine.
[Enrico Fermi Institute] 14:03:49
But I I don't think that that's going to be.
[Enrico Fermi Institute] 14:03:54
In addition to the other tapes. Resources they always they have I mean the the the big machine that they have right now is from town.
[Enrico Fermi Institute] 14:04:02
That's all. CD It's it's very, very.
[Steven Timm] 14:04:04
great if you look at their website. Yeah, if you look at the cat website, there is also zoom We are about our leadership class facility machine coming to I don't.
[Enrico Fermi Institute] 14:04:06
It's not a leadership plus
[Steven Timm] 14:04:19
Think they say one it's going, but they say it's coming
[Enrico Fermi Institute] 14:04:22
Yeah, So they they've gotten authorization to do science studies.
[Enrico Fermi Institute] 14:04:26
And you know, they're they're doing all the kind of energy gathering to do such a thing. But at some point somebody has to come up with a slug of money, and I think if what Congress has authorized the Nsf has sufficient slow the money because they're total budget goes up
[Enrico Fermi Institute] 14:04:45
by 20. But Congress, at least in 2,022 has not actually given them the money.
[Enrico Fermi Institute] 14:04:54
So that's why I kind of. That's where it gets into crystal ball or anything you can lose your your whole afternoon to try to guess what funding agencies are going to do.
[Enrico Fermi Institute] 14:05:01
So I I wouldn't suggest doing that. But you know again, the short version is, I I personally believe that there's always going to be some sort of heavy Cpu resources, because they are wildly popular within and Nsf: there are going to be Gp: resources.
[Enrico Fermi Institute] 14:05:18
So all the Gpus that are. Oh, I guess you have.
[Enrico Fermi Institute] 14:05:21
Britain's too, but it's gonna be a very balanced, based on the user.
[Enrico Fermi Institute] 14:05:26
Community Yeah, the thing that might change would be different or grow is whether or not you believe this tack leadership facility? Good.
[Steven Timm] 14:05:34
good.
[Enrico Fermi Institute] 14:05:35
Okay, So we will The one. The question, though.
[Ian Fisk] 14:05:36
oh!
[Ian Fisk] 14:05:41
Bye, I wanted to mention a couple of things, Expanse Is not that big expanse night expense is 90,000 cores which makes it like a 10 of the Wsg It's not it's a it's it's far from a leadership class machine
[Enrico Fermi Institute] 14:05:50
Yeah.
[Steven Timm] 14:05:51
Indeed
[Ian Fisk] 14:06:04
and I I think the thing that. And if you look at where Nfsf.
[Ian Fisk] 14:06:08
Has spent their money. They've also spent their money on really exploratory things, like like voyager, which is an Ai.
[Steven Timm] 14:06:13
Yeah.
[Enrico Fermi Institute] 14:06:14
What is they? Have an arm, chess bed, Stony Brook right now.
[Ian Fisk] 14:06:15
Yeah, And yeah, yeah, they have the So Japanese name.
[Enrico Fermi Institute] 14:06:20
or commi. I think
[Ian Fisk] 14:06:21
yeah, And so they've also spent some money in exploratory things.
[Ian Fisk] 14:06:27
And my guess is that Brian's right in the sense that they will Nsf is a little bit more in tune to what people are using, But you could imagine that, like that could change and as people figure out How to use alternative machines that like the Gpus in addition to having a lot more processing
[Steven Timm] 14:06:29
Yeah.
[Ian Fisk] 14:06:45
power are a lot more processing power per block that becomes important to people like that then there'll be pressures there, too.
[Enrico Fermi Institute] 14:06:48
Yeah.
[Enrico Fermi Institute] 14:06:54
Yeah, that that's I. I guess the point I was making is Nsf.
[Enrico Fermi Institute] 14:06:59
Is very attuned into the user base. 5 years from now the user base is screaming for Gpus because machine learning has eaten the world.
[Ian Fisk] 14:07:09
right.
[Enrico Fermi Institute] 14:07:10
Then then you're gonna see a much stronger, and and under, even if if that doesn't happen, I don't get the impression that there's a lot of growth opportunity even at Nsf: Funded Cpu: Hbc: Yeah, it's a little bit organic growth.
[Enrico Fermi Institute] 14:07:27
I mean the bridges choose faster than bridges, and expanses a bit more fast than in common.
[Steven Timm] 14:07:27
Great
[Enrico Fermi Institute] 14:07:32
But it's not a magnitude, but it's not.
[Enrico Fermi Institute] 14:07:33
It's not. They don't like double or triple the capacity from step to the left
[Steven Timm] 14:07:36
Great
[Steven Timm] 14:07:40
This is a question. I'm not sure if you're gonna come to it later in the thing.
[Steven Timm] 14:07:44
If something was too early to ask. But you see, or even more Cpu, that you need, and existing leadership class facilities are not going to grow with them much.
[Steven Timm] 14:08:00
During their time, Your location on them is that we can grow that much by that time, and but you had our national web.
[Steven Timm] 14:08:07
They're not buying more because strateg strategically, seeing we're going to the we're going to the leadership class facilities.
[Steven Timm] 14:08:16
We we're we're seeing it because but there's a gap there's going to be a gap of between 50 and 70% of the resources you need are not going to be there.
[Steven Timm] 14:08:26
This is The projections are very. You can done Hpc's not gonna solve the whole problem.
[Steven Timm] 14:08:30
They're not enough of them good if you guys at all.
[Enrico Fermi Institute] 14:08:34
Hmm. I mean if you, if you can use this, the the Gpu, and that gets to the second point We have the Lcf.
[Steven Timm] 14:08:43
Yeah, yeah.
[Enrico Fermi Institute] 14:08:44
Where I'm going a little bit into the Lcf.
[Enrico Fermi Institute] 14:08:46
Landscape, and then we discussed a lot of that already in the morning session.
[Enrico Fermi Institute] 14:08:50
But one thing is the trend trick. To accelerate us.
[Enrico Fermi Institute] 14:08:56
if you look at what's there in terms of cpu, that's usually significant.
[Enrico Fermi Institute] 14:09:01
Most of it is on the Gpu side which we can't really use effectively right now for the but there's a lot of cpu there, and what's in my mind what's an open question I think is what's the threshold for being able to use these machines what's
[Enrico Fermi Institute] 14:09:19
good enough in terms of Gpu. Use utilization.
[Enrico Fermi Institute] 14:09:24
I don't know the answer to that. I know that very early on when that move started to happen, it was a state that was statements that I heard from people that were meetings with the agency that they say Oh, you have to use these full-on gpu utilization or you're not going to
[Enrico Fermi Institute] 14:09:42
get allowed on the machine, and that's softened significantly over time.
[Enrico Fermi Institute] 14:09:46
But still, I mean, there's there's the 2.
[Enrico Fermi Institute] 14:09:50
There's 2 sides. One is this: What do we need to do to get a proposal through?
[Taylor Childers] 14:09:56
sure, sure.
[Enrico Fermi Institute] 14:09:57
What? How much do we need to use the Gpu?
[Enrico Fermi Institute] 14:10:00
So we don't feel ashamed of running on these resources ourselves.
[Enrico Fermi Institute] 14:10:05
there's a certain point where it's just ridiculous, even if they would allow us to run that right?
[Enrico Fermi Institute] 14:10:10
So we have a question coming from problem
[Paolo Calafiura (he)] 14:10:12
it's it's it's a comment. Really.
[Paolo Calafiura (he)] 14:10:17
I I keep hearing this, the problem framed in this way, not only here, but you know in Atlas a lot even more than here.
[Paolo Calafiura (he)] 14:10:26
Probably like all darn and the the Hpc. Community is making this move to Gpu.
[Paolo Calafiura (he)] 14:10:32
They Are losing all of their users? I I don't have a precise data, but my understanding and adoptically is that today, if you want to run on a Gpu Node on parameter you have to wait hours, so the then the we are we are legged, okay, the new
[Enrico Fermi Institute] 14:10:47
Yes.
[Enrico Fermi Institute] 14:10:52
Yeah.
[Paolo Calafiura (he)] 14:10:54
communities. They have no problem whatsoever in using accelerators.
[Paolo Calafiura (he)] 14:10:59
So we have a choice. Either we either. We become like banks, We keep planning our Ibm V.
[Paolo Calafiura (he)] 14:11:05
Three-seven, and call ball, or and we are fine, you know we have the money to do it, and we accept the physics limitation that come with it.
[Paolo Calafiura (he)] 14:11:16
Or we jam. I think this. The you know, framing the problem like, Yeah, maybe Nask is gonna give.
[Paolo Calafiura (he)] 14:11:23
I mean, next is gonna give us what we have now presumably for the lifetime of per matter.
[Paolo Calafiura (he)] 14:11:29
That's about 1%. Oh, that's a a simulation.
[Paolo Calafiura (he)] 14:11:33
I know the the outlaws numbers. I don't know the others.
[Paolo Calafiura (he)] 14:11:36
I mean is it? It It's nice to have it.
[Paolo Calafiura (he)] 14:11:40
But is it? Is it worth having a workshop? About 1%, you know, as multi or 2?
[Paolo Calafiura (he)] 14:11:45
I think we I think we either. We make the the see that we make the jump, or or we are.
[Paolo Calafiura (he)] 14:11:53
We just step out and we say, Look, we will use our legacy cpus, and then perhaps for ram 5, when I'm retired, or worse, we will, use Whatever architecture is is he's so about that so I I I think we're framing the problem.
[Enrico Fermi Institute] 14:12:06
But
[Paolo Calafiura (he)] 14:12:11
The problem in us slightly wrong way, and I know that I know that there are other slides discussing the discussing accelerators and whatnot.
[Paolo Calafiura (he)] 14:12:21
But yeah.
[Enrico Fermi Institute] 14:12:23
But but, Apollo, that the jump it's not going to be a jump to the top in one.
[Enrico Fermi Institute] 14:12:27
Go We're going to jump up one step, and then we might.
[Enrico Fermi Institute] 14:12:30
We can jump up the next step, and so on, and and for that to get to that first step.
[Enrico Fermi Institute] 14:12:36
That's basically my question, Because
[Ian Fisk] 14:12:37
right. But I think Dirk would probably say, which I agree with is that I think we at some point we have to commit, that we are going to make, that this is a step we're going to make that we're going to succeed at this and We can define what success.
[Ian Fisk] 14:12:51
Looks like, but we sort of have to like it. Says you're going to do this, and I think, and you I think you have to say that because like to first order, all of the processing is in these machines the other thing is, I think we're actually not as far as we think like
[Enrico Fermi Institute] 14:12:54
Yeah, I mean.
[Ian Fisk] 14:13:06
atlas, and not Atlas Cms. At least.
[Ian Fisk] 14:13:10
LCD. Are all using Gpus in the online right now. Running software.
[Ian Fisk] 14:13:13
They wrote, We're not that far away, and I think the you can define whatever sort of metric that you want.
[Enrico Fermi Institute] 14:13:14
Okay.
[Ian Fisk] 14:13:20
But my guess is that a few algorithms that show that the thing is faster with the Gps than without enough to sort of get you in the door
[Enrico Fermi Institute] 14:13:28
But yeah, that's that's that was my question.
[Enrico Fermi Institute] 14:13:30
I think that. And I agree with the with the answer. I just wanted to phrase it as a question, because I know there are disagreements about that. And there, are also statements from the people that fund these machines that years ago that were different than that
[Ian Fisk] 14:13:40
Alright, and I think the and one of the things that we have to be a little bit careful of is that you can be a victim of your own success here, like if you take advantage of the accelerated resource.
[Ian Fisk] 14:13:51
And the process. The time for reconstruction of the tracker and Cms goes up by a factor of 10.
[Ian Fisk] 14:13:56
Like We do not have an Io system that's designed to handle twice, 10 times the data going in
[Enrico Fermi Institute] 14:14:05
There's a comment from Eric
[Eric Lancon] 14:14:09
yes, I wanted to go back on what? The power and yeah, make sure.
[Eric Lancon] 14:14:17
And I believe that are 2 topics which are mixed here.
[Eric Lancon] 14:14:21
It's accelerators and Hpcs.
[Eric Lancon] 14:14:27
So as mentioned by Yan with the code radio will be ready by almost of the experiments by necessity, to for using accelerators.
[Eric Lancon] 14:14:40
So nothing prevents classical sites to Well, further. Accelerate us as a resources for the experiment.
[Eric Lancon] 14:14:51
No the use of the big H species he is supposed to to to Hmm!
[Eric Lancon] 14:15:01
Hmm to address the lack of cpus rapidly moving forward for eigenvectors
[Enrico Fermi Institute] 14:15:12
Okay.
[Eric Lancon] 14:15:16
Is the missing factor as big as we believe. That's what we have to understand.
[Eric Lancon] 14:15:23
Because do we need to use H. Pc. Or not? The read question to complement the classical resources beyond the standard extra operation? It's not so clear.
[Eric Lancon] 14:15:34
That's really really need the the big Hpc.
[Eric Lancon] 14:15:43
For complementing the effort of the I. At the Eigenvalues.
[Eric Lancon] 14:15:44
Is it it true or not? Maybe it's only effect off 50% above the needs
[Enrico Fermi Institute] 14:15:56
Okay.
[Paolo Calafiura (he)] 14:16:00
I can comment on the needs is already at my end up having been involved in the calculated One of the things we have to keep in mind is that the needs 2 sort of naturally tuning to the to the resources available.
[Paolo Calafiura (he)] 14:16:20
So there is no point in paralleling. Your needs are 100 times bigger than the resources you are available.
[Paolo Calafiura (he)] 14:16:26
So you make choices which makes those needs go down.
[Paolo Calafiura (he)] 14:16:32
And and what what I'm very nervous about is that as we try sort of to to to achieve a a a, a, a reasonable set computing model, we are potentially giving up things that we could do especially in a world of precision physics that we which is what which is the
[Paolo Calafiura (he)] 14:16:55
one where we are moving towards with the 1 3 run 4.
[Paolo Calafiura (he)] 14:16:58
I don't know about on 5, so I'm a little bit nervous that we that yeah, we we don't really need It It's still we don't really need it.
[Paolo Calafiura (he)] 14:17:08
But because we're making sure physics choices which are allowing us not to need it, and whether those choices are wise or not, I I probably not competent department, but they they said
[Enrico Fermi Institute] 14:17:28
The end was yeah.
[Ian Fisk] 14:17:29
Yeah, it was. It was also. It was just a comment about the scale, which is to say that I think that we've been sort of like driven into a We started planning for the W's the at Atlanta we had sort of factors of 6 or 10 more than we could expect.
[Ian Fisk] 14:17:45
And that we saw that it was really terrible. And then we've made some improvement.
[Ian Fisk] 14:17:49
So we fix it, and now it's down. But like the difference between failing completely and sort of making some really painful choices I think we're now at the level of like if, the Hbc's got us 25% and that allowed us to make a lot fewer really painful
[Enrico Fermi Institute] 14:17:57
You.
[Ian Fisk] 14:18:04
choices like I understand, 25% is not a factor of 4 or 5 like.
[Enrico Fermi Institute] 14:18:06
Okay.
[Ian Fisk] 14:18:10
It was a few years so back, but it it seems like like there was a time.
[Ian Fisk] 14:18:14
Certainly if someone told you that you had 20% more computing resources, you would have been through
[Ian Fisk] 14:18:24
And it just seems like these The these Brisbane are on the table.
[Ian Fisk] 14:18:28
They are. So we built them. They're there. It seems like we would be.
[Ian Fisk] 14:18:34
It's a really straight. It'd be a really strange choice not to at least try to use them
[Eric Lancon] 14:18:40
no, no, I agree. But the first thing is to get the software
[Enrico Fermi Institute] 14:18:50
Yeah, maybe that's a good way to lead over to the next, which is looking at how we're actually using these facilities like some of the integrations next slide
[Enrico Fermi Institute] 14:19:02
So where are we actually running today? Actively So, Atlas, you want to say something about So now, let's we've been, you know, using Corey and promoter for multiple years.
[Enrico Fermi Institute] 14:19:15
we we had an in having the hopper proposal for using attack from Tara.
[Enrico Fermi Institute] 14:19:21
Again. In the past we used olcf nails.
[Enrico Fermi Institute] 14:19:25
Yeah, yeah. But those are sort of government now. Yeah, most of the focus is on on the on nurse control. Better.
[Enrico Fermi Institute] 14:19:32
And but tackle. Yeah. Cms: Similarly, we focused on the user facilities because low hanging fruits it was easier?
[Enrico Fermi Institute] 14:19:42
And Corey Palmera multiple years, we have a exceed now, I guess, is access, hasn't happened yet.
[Enrico Fermi Institute] 14:19:50
So the next one you'll we'll have to deal with access We we had been running on whatever was available.
[Enrico Fermi Institute] 14:19:58
Currently that set is purchased to expense Anvil and Samp, 2 in the past.
[Enrico Fermi Institute] 14:20:04
It was Bridges comment there was, and Frontera, we've been running for multiple years, and then we had in the past, and one currently active in the past.
[Enrico Fermi Institute] 14:20:16
We had the theta allocation that was joined with with outlast.
[Enrico Fermi Institute] 14:20:20
We said to do some generated And now we have actually trying bit something a little bit more serious, which is on summit to get the contribute summit resources.
[Enrico Fermi Institute] 14:20:35
To the end of year, 22 Cms.
[Enrico Fermi Institute] 14:20:40
Data view construction, and this the physics, Validation of power was just completed, not mid summit, but with my 2,100, which is basically exactly the same system.
[Enrico Fermi Institute] 14:20:51
Architecture, the summit, but that was cpu only validation.
[Enrico Fermi Institute] 14:20:56
So hopefully she'd be old as the next step. Basically, that's what we want to do with sound.
[Enrico Fermi Institute] 14:21:03
Yeah. Also have some slides from the you know, Yeah, there's European efforts as well.
[Enrico Fermi Institute] 14:21:09
Just wanted to show it as an example of what's because they they follow sometimes different approaches and in terms of integration.
[Enrico Fermi Institute] 14:21:16
So you're using Gpus in the end of 2,020 into data. Really, that's the plan that we want to use.
[Enrico Fermi Institute] 14:21:22
We have 50,000 h on parameter that we got the allocation, and we have 50,000 h, and some which is not much, which we hope, so.
[Enrico Fermi Institute] 14:21:31
It's not going to contribute a lot, but we just want to show proof principle.
[Enrico Fermi Institute] 14:21:36
And then, if it works, then we would ask for more, hours for the next salesc to do this again.
[Andrew Melo] 14:21:41
sure, sorry. What was the second half of Rob's question I heard, and you want to use Gpus, and then I kind of yeah
[Enrico Fermi Institute] 14:21:41
But with the larger
[Enrico Fermi Institute] 14:21:51
So the I was asking if the in the plans for the end of 2022 data re-record and if you're going to use Gpus.
[Enrico Fermi Institute] 14:22:03
yes, I mean the the the problem is more at the moment, and putting together a workflow but trying to figure out which if you algorithms are ready, put it in and it, it might just be that we're going to run something in parallel to the normal, reconstruction, and then use, that as a
[Enrico Fermi Institute] 14:22:23
validation, Maybe run some validation samples. I would be happy with that as well.
[Enrico Fermi Institute] 14:22:27
It's not directly immediate to be reconstruction, but that more like work for again, and that they can compare
[Andrew Melo] 14:22:35
It is so about that we actually do have a an offline work, re reconstruct, workflow that's very close to being validated.
[Enrico Fermi Institute] 14:22:39
Okay.
[Enrico Fermi Institute] 14:22:44
And I know I know, I know.
[Andrew Melo] 14:22:45
Yeah, yeah, but it's just. It's just a matter of there's some There's some issues with the the Cp.
[Andrew Melo] 14:22:52
Side of the memory being, you know, take more than it needs, but I think by the end of the year, for sure, we're going to at least be doing some fraction of the reconstruction using with Gpus
[Enrico Fermi Institute] 14:23:01
Yeah, I hope I hope that that will happen, and then we can
[Enrico Fermi Institute] 14:23:07
Great. Yeah, as far as integration goes, specific technologies for Atlas we're we're using Harvester that runs at the edge.
[Enrico Fermi Institute] 14:23:19
So at all of our Hpc facilities we run a harvester process that essentially exists on the Hpc.
[Enrico Fermi Institute] 14:23:24
Login, nodes, Harvester directly pulls, drops down from Panda, transforms them, and packs them appropriately, so that they can, you know, be sent to the local Hpc.
[Enrico Fermi Institute] 14:23:36
it also handles the data transfer. So it facilitates staging.
[Enrico Fermi Institute] 14:23:40
That that data in and out of the pursuit of data federation essentially by way of a third-party service that lives at Bnl.
[Enrico Fermi Institute] 14:23:50
Hum. Yeah, And so you know, this approach works kind of on on all the sites, including Lcs, because pilots don't necessarily have to talk to the wider your network. Everything is, local and and Harvester facilitates all the communication panda through the shirt and file system.
[Enrico Fermi Institute] 14:24:12
Then we do things a little bit differently. Busy has advantages and disadvantages.
[Enrico Fermi Institute] 14:24:18
The advantages mostly on the Hpc. Integration of the user facilities, because it really makes it look like a great side.
[Enrico Fermi Institute] 14:24:30
It's basically the same approach we use for opportunistic was to use when we tried to run on the Ligo side.
[Enrico Fermi Institute] 14:24:36
We were basically we, the software is available. Here. Cvmfs or Cvs X.
[Enrico Fermi Institute] 14:24:42
That we run ourselves, we use container solutions or Sm.
[Enrico Fermi Institute] 14:24:46
Independence, local squared, and no man should storage at at these facilities, so we treat it as an extension of it's basically an add-on to firmly love storage so it uses.
[Enrico Fermi Institute] 14:24:58
Firm enough storage, or avoiding Aaa, the the whole Cms stars, but mostly from it up for reading input data, streaming input data.
[Enrico Fermi Institute] 14:25:06
And Then it stages out directly to fungi, so we don't have to worry about the local side storage or data transfers.
[Enrico Fermi Institute] 14:25:11
extension, managed. It's just everything is contained within the job, and the provisioning integration follows the Osg models.
[Enrico Fermi Institute] 14:25:21
So we submit pilots through ht Condo, Bosco.
[Enrico Fermi Institute] 14:25:23
Remote. Ssh! That's either the case of nurse directly connected to have cloud, or for exceed tag resource.
[Enrico Fermi Institute] 14:25:31
We go through was g-managed. HD. Conferences, and we might eventually also do the same for those you stage in or streaming is dreaming.
[Enrico Fermi Institute] 14:25:40
And do you know, have you measured, oh, staging and streaming to see the We know we know it for Nask, because a nice, we have no, basically we it's not the storage is now fully integrated, but at the beginning.
[Enrico Fermi Institute] 14:25:56
It wasn't fully integrated, and we just copied in more or less manually, The most often use pile up library.
[Enrico Fermi Institute] 14:26:04
They give us some space for that, and I actually have a comparison.
[Enrico Fermi Institute] 14:26:07
It makes very little difference for job failure reads Cpu: Efficiency is about 5 to 10% different.
[Enrico Fermi Institute] 14:26:14
Okay, So it's a small It's an efficiency organization.
[Enrico Fermi Institute] 14:26:19
It's a noticeable effect, but it's not a huge effect exactly.
[Enrico Fermi Institute] 14:26:22
You don't see a 50% trial, for example.
[Enrico Fermi Institute] 14:26:28
And the downside of this I mean the the upside is that it's it's it's simple.
[Enrico Fermi Institute] 14:26:34
We don't have anything running permanently at the Hbc side.
[Enrico Fermi Institute] 14:26:38
It's basically completely follows the the grid model integration.
[Enrico Fermi Institute] 14:26:43
The downside is that the Lcf. Are really not really compatible with this approach, because you don't have the outbound Internet you can't follow this approach completely The runtime kind of works the same way, because Cbmfs Xx and singularity.
[Enrico Fermi Institute] 14:26:58
Are both there, so that part works, and as long as you can, somehow, what a split server on the edge!
[Enrico Fermi Institute] 14:27:03
You can do things. The the degrade at the provisioning layer It's the larger issue.
[Enrico Fermi Institute] 14:27:11
Yeah, and we we only have prototypes. So far, nothing.
[Enrico Fermi Institute] 14:27:13
We would call, okay, okay, and triple a re.
[Enrico Fermi Institute] 14:27:17
So far cost is also not usable, so we can't stream to Lcf.
[Enrico Fermi Institute] 14:27:21
Batch nodes, the 2 possible solutions here X d. Proxy and principle is possible where we only ever talked about it.
[Enrico Fermi Institute] 14:27:30
I don't think anyone has ever set one up at an Lcf.
[Enrico Fermi Institute] 14:27:33
And it's probably too much network traffic to route through a single edge.
[Enrico Fermi Institute] 14:27:39
Note, no matter how well that mentioned, that is, but not at least so.
[Enrico Fermi Institute] 14:27:43
The scales we're talking about here to make click.
[Enrico Fermi Institute] 14:27:47
the other is that you act actively manage the storage.
[Enrico Fermi Institute] 14:27:50
So you do. Your rush, your integration, it lovers online, and then you just power live Cms.
[Enrico Fermi Institute] 14:27:57
Data management work for management stacks out with that location and pre-stage data And again at the Lcf type scale.
[Enrico Fermi Institute] 14:28:04
I think you you need to actively experience
[abh] 14:28:06
right, and could I pipe in here just for a second?
[Enrico Fermi Institute] 14:28:09
Yeah.
[abh] 14:28:11
people have used proxies at Nursk, mind you, the setup there is a little bit easier because they have multiple Dtn's, and you can actually put those all of the use all of them, all of the dtns for the proxy server.
[abh] 14:28:23
So so it is possible. But you need a rather fluid setup like nursk
[Enrico Fermi Institute] 14:28:23
Huh!
[Enrico Fermi Institute] 14:28:32
Yeah. As I said at nurse, it wasn't.
[Enrico Fermi Institute] 14:28:34
I mean, I think the work I know connectivity is good enough that we don't really need it at the moment.
[Enrico Fermi Institute] 14:28:41
It's not worth yet. Effort
[abh] 14:28:42
Okay.
[Enrico Fermi Institute] 14:28:45
And problem should be even better, Maybe we haven't really scale tested primarily at that level yet.
[Enrico Fermi Institute] 14:28:52
But from from what I saw with the how, the design has evolved, and that's what create us in terms of network integration.
[Enrico Fermi Institute] 14:28:58
And from what he said as well, I expected to working better and forward.
[Enrico Fermi Institute] 14:29:04
So you're see, the Cs plan is just to I'm going to just not even worry about local storage, and we formula doesn't have a global online license.
[Enrico Fermi Institute] 14:29:21
So our plan is that we do Multi-hop transfers through nurse, because Nasa will at the moment still has gr good ftp, and we're working with them to get the extra D transfers going once that is in place our plan is to to manage the Lcf
[Enrico Fermi Institute] 14:29:35
data transfer through nurse. So everything goes multi-hop through mass, so we will need a bit of space there.
[Enrico Fermi Institute] 14:29:42
So, and once that is a place we might start thinking, exploring, also running actively managed storage there.
[Enrico Fermi Institute] 14:29:49
But I will probably still have a large streaming component as a dumb question we could stop going down the rabbit hole.
[Enrico Fermi Institute] 14:29:54
Okay. But the I assume, like 7 of the tier, two's have global licenses.
[Enrico Fermi Institute] 14:30:01
We could route it through that, too.
[Enrico Fermi Institute] 14:30:05
For different.
[Paolo Calafiura (he)] 14:30:11
and just, to be sure, understand, by provision and integration you mean assigning the work to workers since they cannot reach
[Enrico Fermi Institute] 14:30:19
It's basically the the system. Basically, you have work in the system.
[Enrico Fermi Institute] 14:30:26
That is assigned to an Hbc. Now bring up resources to run that work and route.
[Paolo Calafiura (he)] 14:30:32
Yeah, yeah, yeah, understood. Yeah.
[Enrico Fermi Institute] 14:30:33
The work, then
[Enrico Fermi Institute] 14:30:41
So now we have. We have a slide on the security model, strategic conservation and security model.
[Enrico Fermi Institute] 14:30:48
We probably don't need to spend too much time on, because there's a discussion on Wednesday where we hopefully have some security folks from formula.
[Enrico Fermi Institute] 14:30:59
We invited someone, and maybe from Wsg. As well but we would think we're We wanted to discuss some of the strategic things about Hpc: use, and we we already covered some of it.
[Enrico Fermi Institute] 14:31:12
the yearly allocation cycle that it doesn't fit with our resource planning And so we can plan with resources that we're not sure we will have.
[Enrico Fermi Institute] 14:31:20
But so far we focused mostly on, since they don't fit our resource planning cycle, and we can pledge them.
[Enrico Fermi Institute] 14:31:27
We don't get any credit for it, which is mostly a problem eventually, for the funding agencies.
[Enrico Fermi Institute] 14:31:31
But there's another issue. If we say we are moving into a resource constraint, environment for Hlac, it also means resources that are not pledged, and that we can plan with we cannot include them as part, of our plan, which means our plan, has to artificially be downsized to not consider them
[Enrico Fermi Institute] 14:31:49
which might be a restriction on us at the moment.
[Enrico Fermi Institute] 14:31:52
It doesn't not so much because we have enough resources to cover everything we need to do.
[Enrico Fermi Institute] 14:31:58
But that might not be the case anymore in the Hlac environment.
[Enrico Fermi Institute] 14:32:09
see Erica's handle
[Eric Lancon] 14:32:12
yes, we'd like to intervene, because it's not the first time that we cannot pledge.
[Eric Lancon] 14:32:20
I think it's a bit too strong a statement.
[Eric Lancon] 14:32:26
It might be better to to say that didn't experiment, or the Wcg.
[Eric Lancon] 14:32:34
I need to evolve towards modern addicting campaigns.
[Eric Lancon] 14:32:42
If the because we would like to to use currently those Hpc.
[Eric Lancon] 14:32:49
As a regular Wseg site No, and it's not so very well suited for this.
[Eric Lancon] 14:32:57
You may want to consider that the experiment, you want, the cattle campaigns a few times in the inner year, and this campaign will short duration are exported to those Hpc.
[Eric Lancon] 14:33:12
Which have a large capacity. In that case you could consider great doing these resources because you don't have a flat requirement of Cpu across the year From the experiment, You see what I mean.
[Enrico Fermi Institute] 14:33:30
So you want to pledge it for specific purposes, specific.
[Enrico Fermi Institute] 14:33:35
You want to say like that, that this campaign is is is a pledged campaign on this resource, so that would move away.
[Enrico Fermi Institute] 14:33:42
I think we we had that this morning where we said we want.
[Enrico Fermi Institute] 14:33:46
We move away from the universal, usable resource pledge.
[Enrico Fermi Institute] 14:33:51
That is, basically we can. You could target anything at it to you pledge for a specific purpose.
[Eric Lancon] 14:33:58
yes. Because why is it? The Monte Carlo is quite on across the the year to first order?
[Eric Lancon] 14:34:05
It's because yeah, it's not enough capacity.
[Eric Lancon] 14:34:08
Cpu capacity to absorb the multicarbon simulation Within one month
[Eric Lancon] 14:34:16
Just one month is just an example. So the operational model should adapt to the is the type of resources that the experiments want to use.
[Eric Lancon] 14:34:28
Maybe
[Enrico Fermi Institute] 14:34:32
Okay, Hi Tens, Andrew
[Andrew Melo] 14:34:38
yeah. So. So so I did want to point out first off that there is a meeting.
[Andrew Melo] 14:34:43
The Wc. Meeting is planned for November.
[Andrew Melo] 14:34:47
we're actually going to discuss reopen, for the plan is, I guess at least it to someone who reopen the Mo.
[Andrew Melo] 14:34:54
and to discuss things like this. So I I don't think that that's gonna be stuff there forever.
[Andrew Melo] 14:35:01
And then I think that also, you know, there's there was just the new heps, for Benchmark is is quickly converging, so that we can actually Then Yeah, you know, these things we do have a unit that we can how do you say like, you know, to be able to make a resource request
[Andrew Melo] 14:35:18
then also pledges in. I I do want to push back a little bit and say that like probably don't want to have the pledging infrastructure Be so phygrained to say that we are going to request that we get X amount of whatever's for a certain amount of time
[Andrew Melo] 14:35:36
the resources. But I do think that the ability to
[Andrew Melo] 14:35:45
Put. Put put these put these facilities into the pledge, and in a holistic way, is something that's going to be hopefully coming with the with the cycle of everything.
[Andrew Melo] 14:35:51
How it works definitely. Not 24, but maybe in like the 2526 time scale.
[Andrew Melo] 14:36:03
I think that, like
[Andrew Melo] 14:36:04
I think that, like you know, with with with the benchmarks and come around that we can actually, you know.
[Andrew Melo] 14:36:10
Say what they need to quantify with these machines are, and the I guess political idea that we're gonna reel from the conversation on the Mlu that hasn't been sense, you know, or whatever it is, I think that this is something that we can hopefully get done, in the next you know in the short
[Andrew Melo] 14:36:25
term
[Enrico Fermi Institute] 14:36:27
Okay.
[Enrico Fermi Institute] 14:36:27
okay.
[Enrico Fermi Institute] 14:36:30
Okay, smart a com.
[simonecampana] 14:36:35
yes, I think there is a bit of confusion. First of all, on the latest topic.
[simonecampana] 14:36:41
If you read the mou there is nothing written there, says that an Hpc.
[simonecampana] 14:36:46
Cannot be used as a place to resource as simple as that, so one doesn't have to.
[simonecampana] 14:36:50
Redis. Discuss them, or you to discuss this. I There are good Hpc is the impact of the pledges, since at least a decade and a half in the Nova country, you know the Tier one provides resources also partially through time on an hbc so the reality is that the mou tells
[simonecampana] 14:37:10
you the basic principles of what can be considered a pledge, Resource has to be something with a certain amount of ability.
[simonecampana] 14:37:18
Availability needs to be accounted for. You need to be able to send a ticket to it, and that's what it says.
[simonecampana] 14:37:22
So I think that you know, in terms of policy, we don't need the and made Zor discussion and every right of the emoji.
[simonecampana] 14:37:35
The work can start today. Think there is something technical to be done, because a lot of what I just mentioned.
[simonecampana] 14:37:40
Yeah, Okay, be a technical detail, But someone still has to do the work of integrate integrating the facility properly.
[Enrico Fermi Institute] 14:37:49
But but
[simonecampana] 14:37:50
The other thing is that when is is the comment I made this morning when you try to define a facility that works for one use case you have 20%, which granularity you want to get If it.
[simonecampana] 14:38:06
Is monte Carlo versus the data processing fine.
[simonecampana] 14:38:09
If it is a second kind of Monte Carlo, a bit less fine, if it is only a bench generation, because it's the only one that doesn't need an input it starts becoming really finegrained. And for the one of you who participated to a discussion at the Rugby and you know
[simonecampana] 14:38:25
the all the process that has to do with resource, requests, etc.
[simonecampana] 14:38:31
This becomes very complicated very quickly. So at the end, the risk is that we do a lot of work to pledge Hpcs for a benefit that is not particularly measurable.
[Enrico Fermi Institute] 14:38:38
Yeah.
[simonecampana] 14:38:46
I think we are confusing. We cook and the work that those Hpcs are doing, and this should be done with the idea that those Hpcs are a multi-purpose facility which today many of them they are not some of them if you try to discuss with the Awkward for
[simonecampana] 14:39:03
example today, there is not a lot you can do with a quiz unless you can use all those gpus.
[simonecampana] 14:39:09
So is that a multi-pacose facility today is not so.
[simonecampana] 14:39:11
I think there is a bit of confusion around what is a policy?
[Enrico Fermi Institute] 14:39:14
Okay.
[simonecampana] 14:39:16
What is practical, and what needs technical work to be done.
[simonecampana] 14:39:20
So. I think this needs to be organized a bit
[Enrico Fermi Institute] 14:39:25
But but but even at the policy level, the the one example you gave is is something that, I maybe I should use the word non wlcg resource, or something like this.
[Enrico Fermi Institute] 14:39:35
but the the idea of reliability on something where you're not going to use it.
[Enrico Fermi Institute] 14:39:39
9 months of the year and then you're gonna get a burst of, you know, 200,000 cores.
[Enrico Fermi Institute] 14:39:48
Policy wise. I'm not sure that has any translation.
[Enrico Fermi Institute] 14:39:51
I mean that there are for the sorts of resources we're talking about here.
[Enrico Fermi Institute] 14:39:55
It. It doesn't fit within the the policy framework That's that's my my concern.
[Enrico Fermi Institute] 14:40:01
If if the policy is, it needs to be up 90% of the time, and you need access to a certain base load.
[Enrico Fermi Institute] 14:40:09
Of course, first once a year. That's that's not how these things work. So that's why I was saying that we we really do need the policy work here as well
[simonecampana] 14:40:19
a little bit, but the reality is that a lot of what we care about is that not not 90% of your jobs fail when you end up there And this being an Hpc.
[simonecampana] 14:40:29
Or a great site. I'm sorry it It's a useful thing to ask right
[Enrico Fermi Institute] 14:40:36
Yeah, you know, at the same much of the same way that you have, and the power ecosystem, base load, and and variable demand mode.
[Enrico Fermi Institute] 14:40:47
I think we have need to have some more fundamental ideas, and the policy framework.
[Enrico Fermi Institute] 14:40:54
You know we're you don't right now our power grid is built from cold, and only call, and we say that when can't possibly, it'd be counted for, and and we both of course, have been successful
[simonecampana] 14:40:59
yeah.
[simonecampana] 14:41:04
I just
[simonecampana] 14:41:07
I understand. Brian, but you realize that the discussion on availability is not the one that is today is stopping an Hbc. To be a pleasant resource.
[simonecampana] 14:41:14
Right
[Enrico Fermi Institute] 14:41:16
Let's take a couple more quick comments, and then we can have more discussions about pledging on on Wednesday. Yeah, we have a dedicated discussion, Andrew, do you have a quick comment
[Andrew Melo] 14:41:26
sorry. My hand is still on, but but I'll just quickly point out that.
[Andrew Melo] 14:41:32
but that we can't today do this budget, because it's it's not that the pledging statute you can't use Hbc's and pledging.
[Andrew Melo] 14:41:41
It's just up the room that are set around Plunge the how you fled.
[Andrew Melo] 14:41:45
Resources. Basically, you can't do that, It's it's not that it's like there's an explicit for prohibition from it.
[Andrew Melo] 14:41:52
But you just simply just simply can't do it.
[Enrico Fermi Institute] 14:41:54
yeah.
[Enrico Fermi Institute] 14:41:55
Yeah.
[simonecampana] 14:41:56
I I just don't understand this, but fine I'll let it go.
[simonecampana] 14:41:59
I mean, there are other places where pledge they pledge.
[simonecampana] 14:42:02
Hbc: drop down something that's right.
[Enrico Fermi Institute] 14:42:02
Yeah, yeah, yeah, but they they basically put a grid side on top of it.
[simonecampana] 14:42:07
Well, then, yeah, you have to do some work. Yes, I agree.
[Enrico Fermi Institute] 14:42:07
So with with all the rules. Oh, no! But the problem is here.
[simonecampana] 14:42:10
Yeah.
[Enrico Fermi Institute] 14:42:12
It means that you would have to influence the scheduling of the Hpc.
[Enrico Fermi Institute] 14:42:18
Facility. So the Hbc facility itself would have to adjust internally, adjust their scheduling policy to match the grid model, at least for a fraction of the site And that's just not how things are done in the us We are customer.
[Enrico Fermi Institute] 14:42:33
We don't tell them how they do their scheduling.
[Andrew Melo] 14:42:35
Okay, Or let me give another example. Let's say that you know today, and I I don't I don't know like you know, the inside of it.
[Enrico Fermi Institute] 14:42:35
We use the resources as they give them to us
[Andrew Melo] 14:42:41
But you know, let's say that we're not using Amazon for Cms jobs.
[Andrew Melo] 14:42:46
We can't send sideability, you know. We can't send Sam tests to Amazon right now, so you know, whatever resource, whatever check the Amazon's gonna give doesn't show up, and the the the monitoring, now, it shouldn't be, that way But that's that's how it
[Andrew Melo] 14:43:02
is.
[Enrico Fermi Institute] 14:43:04
let's
[Enrico Fermi Institute] 14:43:05
Let's take a comment from from Ian, and then let's move on
[Ian Fisk] 14:43:07
I My call was, as I understood this was a blueprint meeting which a blueprint is typically the design for something that you're going to build in the future which means that I think we need to be a little bit careful when we talk about.
[Steven Timm] 14:43:07
good.
[Ian Fisk] 14:43:19
Sort of the reality of right now and the limitations that we face right now and try to be able to see a little bit farther ahead.
[Ian Fisk] 14:43:26
For when some of the times when those limitations will not be there, and so if we want to talk about pledging, maybe we need to sort of define it.
[Ian Fisk] 14:43:32
In such a way that it it's the ability to maybe the ability to run All workflows or the ability to run some subset of workflows.
[Ian Fisk] 14:43:41
But I I think it. We we do ourselves a disservice.
[Ian Fisk] 14:43:43
If we expect that nothing's going to change, because I think we will, as a field along with the rest of science, figure out how to use these machine, and we will, and we will figure out how to use clouds.
[Ian Fisk] 14:43:57
And we're and we need to sort of plan for our own success.
[Ian Fisk] 14:43:59
I think
[Enrico Fermi Institute] 14:44:05
So that's a great point
[Enrico Fermi Institute] 14:44:08
month. Yeah, we already talked quite a bit about the second point.
[Enrico Fermi Institute] 14:44:13
I just wanted to go into it a little bit, because the one ish thing that okay hasn't brought up yet.
[Enrico Fermi Institute] 14:44:20
So so basically how we deal with more larger architecture changes.
[Enrico Fermi Institute] 14:44:24
We we went into that quite a bit. Already We we already seen this.
[Enrico Fermi Institute] 14:44:29
Today, we have, we see multiple Gpu architectures, basically the early porting efforts to Gpu They focused on Nvidia because that's what everyone is using to a large extent.
[Enrico Fermi Institute] 14:44:40
That's still what everyone is using. But if you look at what the Lcf.
[Enrico Fermi Institute] 14:44:43
Is deploying Frontier has a D. Whenever maybe different, we'll have intel.
[Enrico Fermi Institute] 14:44:52
So what are we doing there and then? The next generation might have some weird Fpga ai acceleration.
[Enrico Fermi Institute] 14:44:58
Who knows? I know that the framework groups, and this is outside the scopeia is, is looking at performance, portability, solutions.
[Enrico Fermi Institute] 14:45:06
so far it looks like yes, you can run everywhere, but you take a severe performance.
[Enrico Fermi Institute] 14:45:11
It? Is that enough? That's an ony topic here, but that's the only alternative If that's not enough.
[Enrico Fermi Institute] 14:45:20
And if this doesn't work, then you kind of have to limit what you can target, because I'm not sure
[Taylor Childers] 14:45:26
sure. Can I push back on that? The you know the Pps group and and have Cce has shown that you can use these frameworks, and sure gonna take a performance that.
[Taylor Childers] 14:45:38
But I would argue. 10% is not something that is worth the effort.
[Enrico Fermi Institute] 14:45:41
Okay.
[Enrico Fermi Institute] 14:45:45
If there was a question mark, because maybe maybe it is to rescue
[Taylor Childers] 14:45:45
especially in the mad graph case. Right?
[Taylor Childers] 14:45:50
I mean, we're running mad graph with base cuda sickle cocos, alpaca, and sure cuda outperforms.
[Taylor Childers] 14:46:02
But the amount of work that has gone into the kuda to get another 10% It's just not worth it.
[Enrico Fermi Institute] 14:46:11
Because I think the the 2 options here are like, given the what we have to do in terms.
[Enrico Fermi Institute] 14:46:17
And I know this is outside the scope of the workshop, but it impacts what we can plan with basically the only 2 options, Either performance put the portability or we just don't target a certain architecture because we cannot just every 5 years, if lcf decides they want this new greatest and best
[Enrico Fermi Institute] 14:46:36
acceleratorship. We cannot just refactor our old software stack It's just not fun.
[Enrico Fermi Institute] 14:46:44
So
[Enrico Fermi Institute] 14:46:48
Okay. And then in terms of strategic considerations, the use, just because we managed to be able to use this generation's Lcf.
[Enrico Fermi Institute] 14:46:59
Doesn't really guarantee that we can use the next, So we need to keep that in mind when we kind of do the long-term planning, because that might come a point where basically the amount of usable usable for us hpc deployment goes down and we need to shift that
[Enrico Fermi Institute] 14:47:15
capacity, some ways
[Enrico Fermi Institute] 14:47:21
And then there's a quote anyone else have any other comment or concern.
[Enrico Fermi Institute] 14:47:27
Strategically about going all in on the like, making the jump, as Paolo said.
[Enrico Fermi Institute] 14:47:32
It on the Hpc. Side where we can miss Ms jump
[Enrico Fermi Institute] 14:47:39
3 in terms of making the jump mean. I mean, we can sort of hedge our bed a little bit with that, Right? I mean, we don't have to make to jump with 100%.
[Enrico Fermi Institute] 14:47:52
Of our computing on. So I mean, that's I mentioned that you don't jump in one.
[Enrico Fermi Institute] 14:48:01
at the top. You make a small jump, you see where you are. And you make another jump.
[Enrico Fermi Institute] 14:48:07
It's a gradual process
[Paolo Calafiura (he)] 14:48:09
one thing. One thing I want to say, which I've heard from from a reliable source is some some community with my multiple jumps is the first jump is the worst one.
[Enrico Fermi Institute] 14:48:10
Yeah.
[Paolo Calafiura (he)] 14:48:22
The same, and the fourth are increasingly easier, the more the more the more you go for one after that architecture to the other, the the least the least you have to to feed that we are your call could go from one
[Enrico Fermi Institute] 14:48:40
Yeah, I didn't even mention it here, because I don't think it's a big problem.
[Enrico Fermi Institute] 14:48:44
The multiple Cpu architectures. I think that's at least I don't see the big issue on the Cms side.
[Enrico Fermi Institute] 14:48:50
That's just usually, just a recompile and a revalidation.
[Enrico Fermi Institute] 14:48:55
The the jeep, the jump to Gpu and I just I'm not
[Paolo Calafiura (he)] 14:48:58
no What I'm saying is that once you jump to Gpu or to let's say, a parallelization layer, whatever it is that is a very painful jump.
[Paolo Calafiura (he)] 14:49:09
But once you have done that jump, but going from one Gpu to another, or from one Gpu to some, so far I'm known architecture, which we do, you know, the French are both matrix multiplications, and what jacks for example, going to Jack's maybe maybe less less painful than than the first
[Enrico Fermi Institute] 14:49:11
Just
[Paolo Calafiura (he)] 14:49:27
one That's what I'm saying. That's what I was trying to say.
[Enrico Fermi Institute] 14:49:35
Okay, we move on. I think we have some presentations. Next, let's go in the class We want to say something on this. I don't think we say anything.
[Enrico Fermi Institute] 14:49:47
On the security model. We'll we'll talk about the security model.
[Enrico Fermi Institute] 14:49:48
Yeah, yeah, So you're an Andre, are you? Are you connected?
[Enrico Fermi Institute] 14:49:55
Do you want to share? Yeah.
[Andrej Filipcic] 14:49:56
maybe it's it's a screen. Can you hear me? Right?
[Andrej Filipcic] 14:49:59
Okay, that's Michelle didn't
[Enrico Fermi Institute] 14:50:02
Great. So we want to show a little bit what's going on the European side.
[Andrej Filipcic] 14:50:04
Just
[Enrico Fermi Institute] 14:50:08
Yeah, then we can just as a
[Andrej Filipcic] 14:50:09
Right? So just a bunch of slides. But let me know if you are interested in anything else.
[Enrico Fermi Institute] 14:50:12
Yeah.
[Andrej Filipcic] 14:50:18
Oh, on some specifics over here. So maybe it's a bit too generic.
[Andrej Filipcic] 14:50:21
So the Irish Pc. Joint to the taking his, let's say, a company of 31 States, which I call out here on the right side.
[Andrej Filipcic] 14:50:34
All the members apart. Basically all Europe, and Turkey. Apart from me, Okay, and Switzerland.
[Andrej Filipcic] 14:50:39
And in the first place, which ended last year, the Web, 8 machines funded.
[Enrico Fermi Institute] 14:50:43
Okay.
[Andrej Filipcic] 14:50:46
So 3 prixes scale machines in the range of 250 to 350 billion blops.
[Andrej Filipcic] 14:50:51
so those one Lumi in Finland, Leonardo, which will be inaugurated to November in Italy, and Marin Austin, which will be a bit later.
[Andrej Filipcic] 14:51:02
It goes to Pickerman just finished, but not much details on this machine or yet no, apart from the talk today, Will, he had a quite large Cpu Partition of 30 peasa flops.
[Andrej Filipcic] 14:51:14
Which is quite good. For, let's say so. The second phase is the 6 years up to 27, and the currently approved machines, the high range one the exa scale me which would be a Jupiter the the so the machine was just approved but the procurement, was
[Andrej Filipcic] 14:51:35
not yet done so all no details on this machine, just the the plans right? Basically there.
[Andrej Filipcic] 14:51:42
One to reach one, Maxa flop with some or okay, that's enough.
[Andrej Filipcic] 14:51:47
And so there will be 4 arrangements. So for Hpcs in so investments here between 20 and 40 million Europe rate per each, and those one will be in Greece.
[Andrej Filipcic] 14:52:02
Hungary, or on an island. I think also there'll be some collocated quantum computers.
[Andrej Filipcic] 14:52:12
So the first generation, and this will be approved probably next month.
[Andrej Filipcic] 14:52:18
I was skating So this is just a mission which you can read that day later on.
[Andrej Filipcic] 14:52:24
Basically your Hpc wants to support leadership, supercomputing, including quantum computing and all the data infrastructure around it.
[Andrej Filipcic] 14:52:35
Then they want to develop. They're on hardware, and they want to evolve industry a lot.
[Andrej Filipcic] 14:52:42
Let's say to bullet. So the budget. The budget is pretty 50% of from European Commission and 50% from the hosting states.
[Enrico Fermi Institute] 14:52:46
Okay.
[Andrej Filipcic] 14:52:55
So these are the countries that decide to build the Hpc.
[Andrej Filipcic] 14:52:58
although for the smaller machines you're European Commission only funds 35%.
[Andrej Filipcic] 14:53:04
So in the phase, one, the 3 S. 1 billion euros were spent for the face to 7 to 8 billion is actually foreseen on the on the plot on this table on the picture you have a detailed breakdown from the European Commission and Then there would be the same matching contribution from
[Enrico Fermi Institute] 14:53:11
Okay.
[Andrej Filipcic] 14:53:25
the all the Member States. Okay, and also 200 Me, let's say, 200 million is meant for hyperconnectivity.
[Andrej Filipcic] 14:53:33
So for Terabyte Network and 50% of the money spend for new product infrastructure.
[Andrej Filipcic] 14:53:42
There are many projects in the Tv activities going around it.
[Andrej Filipcic] 14:53:45
So, maybe one important one is you eurocc or European competence center which basically he's a very large project with 30 participants of participant.
[Andrej Filipcic] 14:54:01
State let's say so. Most of them, and the funding is about 1 million Europe or country per year.
[Andrej Filipcic] 14:54:07
the goals are basically to training and connection with the industry and collecting.
[Andrej Filipcic] 14:54:14
So the knowledge on Hpc. Whatever that means. There's also centers of excellence, for example, which are mostly dedicated to, let's say, support software development or scalability, extensions of particular groups.
[Andrej Filipcic] 14:54:28
They can be dedicated to a particular particular field of science, like chemistry, or molecular dynamics, or something like that, or they can be a bit wider in scope for specific.
[Andrej Filipcic] 14:54:38
Let's say data handling for access case something like that.
[Andrej Filipcic] 14:54:44
the about 10 they send us a tax sentence, initial funded between 6 to 8 meetings per project, and those calls would be continuing all the time.
[Andrej Filipcic] 14:54:53
So to to this period. There are 2 bodies. So research, generation, advisory group and infrastructure advisory group, which basically accepts form recommendations for the illusion in develop and so forth basically, for everything for the research calls for funding and for infrastructure deployment
[Andrej Filipcic] 14:55:16
another part of it is your Pm. Process initiative with the name to build European Cpu and Gpu.
[Andrej Filipcic] 14:55:24
Of course, maybe he'll be slightly written. That's later.
[Andrej Filipcic] 14:55:30
There's also your master for Hpc. Which is just a common university program.
[Andrej Filipcic] 14:55:36
So this is a project. The tries that's many countries and universities.
[Andrej Filipcic] 14:55:42
Let's say about 30 of them will try to put the Hpc.
[Andrej Filipcic] 14:55:47
Studies master status typically in sync and share. Let's say, students share lectures, and so on.
[Andrej Filipcic] 14:55:57
there are about 30 projects altogether, so the resource location access is only provided to you in typical users.
[Andrej Filipcic] 14:56:07
So basically to members of European Union, the extended one, actually the European Commission, we so share is very similarly managed as praise before.
[Andrej Filipcic] 14:56:19
So the place, like calls for publications, with some changes, The first one is developing batch parking.
[Andrej Filipcic] 14:56:26
with basically immediate access. So let's say, within a less than a month, maybe even, we think to mix 2 weeks And this is not negligible even in resources.
[Enrico Fermi Institute] 14:56:31
Okay, this.
[Andrej Filipcic] 14:56:37
So you can get something like up to half a 1 million Cpu hours.
[Andrej Filipcic] 14:56:43
for these access, and you get it for for up to a year.
[Andrej Filipcic] 14:56:48
Then the regular access, which is a couple of 10 million Cpu hours.
[Andrej Filipcic] 14:56:53
Cisp reviewed, and there are also there will be calls future on for industry in public sector.
[Andrej Filipcic] 14:57:00
This is not yet right, finalized. Yet, because of the funding issues.
[Andrej Filipcic] 14:57:05
And let's say, charging for the industry. So the hosting entity, share.
[Andrej Filipcic] 14:57:11
So the owner of the the other house of the Hpc.
[Andrej Filipcic] 14:57:14
The country, the policies there are completely regulated by country policies or decisions.
[Andrej Filipcic] 14:57:21
So each State can do whatever they want with their latch.
[Andrej Filipcic] 14:57:28
so overall the design is of some of Hbc.
[Andrej Filipcic] 14:57:37
Is quite classical, but not all of them are really classical.
[Andrej Filipcic] 14:57:40
Hbc anymore, as you know, Vega can. Slovenia can was designed to be strict mind of heavy duty data processing, and outbound connectivity which works actually pretty well.
[Andrej Filipcic] 14:57:54
For Atlas, where Vega contributes something right between 1340% of Cpu.
[Andrej Filipcic] 14:58:01
during the last year, Let's say, then, the second one, Lumi.
[Andrej Filipcic] 14:58:03
They have a very large dedicated partition for visualization and services, and they will provide only so they will provide set object storage for long term data preservation.
[Andrej Filipcic] 14:58:16
So on, and they want to provide all the more than tools modern nose.
[Enrico Fermi Institute] 14:58:19
Okay.
[Andrej Filipcic] 14:58:20
From 5 I mentioned it here. It was not been built, but they set that they will be have much larger cpu partition and open access, because the Government decided that the this machine needs to support Ac.
[Andrej Filipcic] 14:58:35
So this was great already about on overall, in the architecture, so most of these machines are Janet purpose.
[Andrej Filipcic] 14:58:46
Some maybe less general purpose than the others, but they basically all the all of them needs to adapt to the user needs.
[Andrej Filipcic] 14:58:54
So they are. It's a bit different. So they're not completely free to set the policies.
[Andrej Filipcic] 14:59:00
How these machines will be set up, and what services they can provide, because overall the European, your Hpc.
[Andrej Filipcic] 14:59:07
Governing Board, which is representative. These from States can say on what to do with these machines.
[Andrej Filipcic] 14:59:16
Right. And there are many countries that participate in these calls, but they don't have Hpc.
[Andrej Filipcic] 14:59:23
But they would like to to use it. And for basically, for all the science.
[Andrej Filipcic] 14:59:27
And so also interesting, does this stuff. So the current machines mixture of Cpu and Gpu partitions.
[Andrej Filipcic] 14:59:37
So Dcp is mostly Amd. Then some intel recently, for example, would be intel.
[Andrej Filipcic] 14:59:45
Then there is a one arm machine that will be in Portugal based on fujitsa and they have both Nvidia and and the but most have Nvidia Gpus, and some have like you only have Md So Hello, me.
[Enrico Fermi Institute] 15:00:03
Okay.
[Andrej Filipcic] 15:00:05
Is the same, Okay, what's the name of the Ocf.
[Andrej Filipcic] 15:00:08
Right so. But in any case most notes have gpus.
[Andrej Filipcic] 15:00:14
So most of the hardware is Gpu compromises between 60 to 80%.
[Andrej Filipcic] 15:00:20
It depends on the machine. Well, there's one small machine, Cpu only, but all the big machines have.
[Andrej Filipcic] 15:00:25
Let's say, 24% of Cpu notes, not not even Cpu power.
[Andrej Filipcic] 15:00:31
Right? Okay, computing power. So the storage is typically last with Seth, and some also provide some kind of yeahf.
[Andrej Filipcic] 15:00:40
So this one is less popular, and most of these machines, basically a apart from Lumi and Carolina, is in in in Czech Republic we're built up by artists during the future machines.
[Andrej Filipcic] 15:00:59
Will most definitely. So the large one next Xs K machine, which will be built in France, will be arm based, so it would be arm cpu plus gpu, as well.
[Andrej Filipcic] 15:01:10
So details are not clear yet the goal is to build it somewhere in 2425, and after that the next one.
[Andrej Filipcic] 15:01:20
So let's say them excess scale machine, whatever it would be.
[Andrej Filipcic] 15:01:24
They have strong, wishes. Let's see, for now it should be risk 5 based next slide.
[Andrej Filipcic] 15:01:33
So some thoughts, some observation. After 1 point, 5 years of operations of these machines, so each of them have of the order of something like 500 users, which might seem a bit little, for some but in actually most of these users, are completely newcomers, since many other users already have allocations on
[Andrej Filipcic] 15:01:57
the large existing machines, let's say, in Italy, Spain, Germany, or France.
[Andrej Filipcic] 15:02:02
So go. Ones that are part of price, and do these users have really a lot of different kinds of workloads, so that many node computes jobs We've done Cpu on Gpu And this is mostly carol th the the majority i'll say, is chemistry or material
[Andrej Filipcic] 15:02:21
science, although something like at least on Vega. There are something like 30 different applications that the user want to run a lot of users also do small notes or small core parameter scans on tons of independent jobs let's say, let's see like and many many users in the last
[Andrej Filipcic] 15:02:43
year start to use machine learning, even. That less users do analysis with machine learning, and this is rapidly actually growing.
[Andrej Filipcic] 15:02:51
Because let's say it's quite simple with Tensorflow and all the case.
[Andrej Filipcic] 15:02:55
So to locates the around and atar machine, at least in Vega we have a really big pressure on Gpus, so the next machine will buy. We'll have much larger Gpu Partition Oh, some users also do extreme data process processing.
[Enrico Fermi Institute] 15:03:08
Okay.
[Andrej Filipcic] 15:03:14
No I don't mean I let's see here.
[Andrej Filipcic] 15:03:17
But, for example, like something I cry micro microscopy or different stuff, where they produce, let's say, a couple of tens of terabytes per measurement that they want to process is same interactively some hpcs, allocate for notes, only but there are many that can run any type
[Andrej Filipcic] 15:03:32
of jobs, Also, observe, we have observed my experience, that many users are not quite happy with the default a data organization of the Hpcs, which basically more or less doesn't exist.
[Andrej Filipcic] 15:03:45
I would say, although we have other tools in my future. But let's say, within your your Hbc.
[Andrej Filipcic] 15:03:51
The data Migration is a movement. We're not yet discussed and many users stick to containers and some demand event.
[Andrej Filipcic] 15:04:01
But virtualization basically here for your Hpc. What user demands to use it basically should be provided to another later.
[Andrej Filipcic] 15:04:11
there are much more users on your Pc. At this point, and was ever in place.
[Andrej Filipcic] 15:04:16
So this number we probably wrote. So that's it. Cumulatively to 50,000. Pretty soon, on all the all on all the machines and there, are a lot of really a lot of newcomers due to simplicity or taxes, you basically just submit a proposal, not even a
[Andrej Filipcic] 15:04:33
Proposal. Application which quick description. And you will get an access within less than a month.
[Andrej Filipcic] 15:04:39
The usage for the interstate is rising a bit.
[Andrej Filipcic] 15:04:43
this is mostly small or medium enterprises, but this is still not extremely high.
[Andrej Filipcic] 15:04:50
Let's say more or less, the entire to use 20% of Hpcs by European law.
[Andrej Filipcic] 15:04:57
Let's say, by European funding regulations, but they're not yet at 20.
[Andrej Filipcic] 15:05:06
Far from 20% of usage. This point although some Hpcs like the one in Luxembourg, was built entirely to support the industry.
[Andrej Filipcic] 15:05:15
7 countries also decided to provide resources through your Hbc. For 4.
[Andrej Filipcic] 15:05:20
Let see, Slovenia for sure, because I know what's going on here.
[Andrej Filipcic] 15:05:25
You have the same message from Spain or the others be shared.
[Andrej Filipcic] 15:05:28
And lately, even in Germany. So I think that the German wants to only keep Daisy and Kit not not sure if this is official yet, but other countries will probably follow a similar way.
[Enrico Fermi Institute] 15:05:36
question.
[Andrej Filipcic] 15:05:40
So. The and let's see. European. Sorry. Yes, go ahead.
[Enrico Fermi Institute] 15:05:42
Questioning.
[Enrico Fermi Institute] 15:05:47
so you said. Several countries have already decided, you know, like Sylvania, with highly successful Vega.
[Enrico Fermi Institute] 15:05:53
What about the Vega design makes that so much easier to integrate in these would be some of the Us.
[Enrico Fermi Institute] 15:06:06
Snowflakes.
[Andrej Filipcic] 15:06:09
I'll show, because in Reggae we need the pressure to support civilian vessels.
[Andrej Filipcic] 15:06:14
So some some other more classical Hpcs are hesitant in this respect.
[Andrej Filipcic] 15:06:20
But let's say Vega is not so different in hardware.
[Andrej Filipcic] 15:06:23
Architecture than the others apart, that, apart from that that really required a large pipe which can at the moment this pipe can do 600 gigabits per second to one.
[Andrej Filipcic] 15:06:35
That's it to Jean. And this will increase in the future.
[Andrej Filipcic] 15:06:40
So it's mostly a matter of decision. What you are allowed allow users to do over there
[Enrico Fermi Institute] 15:06:49
Okay.
[Andrej Filipcic] 15:06:51
done. The network connectivity. Will will likely boost a lot in the next 2 years. On 2 to 3, as let's say, especially if there's zone.
[Andrej Filipcic] 15:07:01
One or a bit network is seen, don't There are some still open questions about the funding is came, and who can do the networking, and so on.
[Andrej Filipcic] 15:07:14
long term, long term, data story is not that part of the plans So it's a bit on a wild but there's a high pressure of many communities to use this as well. Right?
[Andrej Filipcic] 15:07:25
So the Hpcs. This point are not obliged to provide long-term storage.
[Andrej Filipcic] 15:07:29
Let's say when the Hbc. Is decommissioned, the storage is really likely to be the commission as well, and a new storage will be read, brought up in the new machine right?
[Andrej Filipcic] 15:07:39
But this this will need to change the future. One thing, one thing worth to stress is that some leadership your projects like destination, Earth?
[Andrej Filipcic] 15:07:49
Well, I'm not sure you know it, but destination out to basically as apples.
[Andrej Filipcic] 15:07:54
Ecmwf Weather Agency, and it you mets up.
[Andrej Filipcic] 15:07:59
But they, is to provide a digital between digital going to for Earth right?
[Enrico Fermi Institute] 15:07:59
Hmm.
[Andrej Filipcic] 15:08:06
which includes satellite imaging weather collection.
[Andrej Filipcic] 15:08:12
whether the data collection on weather forecasting, and so on, and basically do a global model of our team predictions And so on.
[Andrej Filipcic] 15:08:23
With, basically, it's a huge project. And this this organization already officially asked, join to the taking if they could use your Hpc on the production.
[Andrej Filipcic] 15:08:33
Level, I'm basically joined the deck and agreed, for now they can use 10% of the All the resources right.
[Andrej Filipcic] 15:08:43
European Commission. But up to 10% and more organization to follow this way, for example, destination Earth doesn't have enough funding or money to to do anything without your Hbc.
[Andrej Filipcic] 15:08:57
At this point. So more projects with this I like this will follow, and maybe even let's see Good.
[Andrej Filipcic] 15:09:07
But this was not discussed yet. I will skip the next slides, because I just a bit of an overview for computer as you can.
[Enrico Fermi Institute] 15:09:10
Okay.
[Andrej Filipcic] 15:09:15
You will see them later on when I upload them
[Andrej Filipcic] 15:09:20
Okay, That's it.
[Enrico Fermi Institute] 15:09:23
Great. Thank you. We have sub raised hands. So, Paulo, his handwriting for a while, go involved
[Paolo Calafiura (he)] 15:09:30
yeah, Do I remember? Oh, yeah, yeah, yeah, yeah, I saw one slide in which you mentioned that short term Let's say the next generation would be armor and the next next one generation may be the risk 5 And I'm wondering if you meant for a cpu replacement, and and therefore
[Andrej Filipcic] 15:09:46
Right.
[Paolo Calafiura (he)] 15:09:53
also, having accelerator. So are you just saying it will be Qr.
[Paolo Calafiura (he)] 15:09:57
More pure risk, file.
[Andrej Filipcic] 15:09:58
No, our movie has accelerators.
[Paolo Calafiura (he)] 15:10:00
Okay, okay, So like, something like.
[Andrej Filipcic] 15:10:05
Yeah, something. Yeah, it's not it will be Grace hopper, style, or separate chips, or whatever
[Paolo Calafiura (he)] 15:10:12
Okay, okay.
[Enrico Fermi Institute] 15:10:16
Okay, there's a hand up for you.
[Ian Fisk] 15:10:18
yeah, my question was I to actually one? Was the the project Earth?
[Ian Fisk] 15:10:24
Is that a strategic alliance between your Hbc.
[Ian Fisk] 15:10:26
And the project. And the is it multiple year? Does it different from the typical peer review?
[Ian Fisk] 15:10:31
Oh!
[Andrej Filipcic] 15:10:31
Yes, it's completely different, because this is a long term project for at least 10 years, And even so, so, it's exactly like So no.
[Andrej Filipcic] 15:10:41
Let me see, let's say.
[Ian Fisk] 15:10:42
Okay, so, but is that. But does that? Is the door open to other multi like things like that?
[Ian Fisk] 15:10:50
Lcg. Lhc. Negotiating such an arrangement
[Andrej Filipcic] 15:10:53
I think so. I mean, the the thing is that European Commission needs to find such projects in interest to support them. And actually those projects are typically listed in S Free Table, where, for example, high luminosity is right
[Ian Fisk] 15:11:10
Okay, was there. I I may have missed it.
[simonecampana] 15:11:13
I think
[Ian Fisk] 15:11:17
But is there a second ex scale machine in France someplace?
[Ian Fisk] 15:11:22
I I thought the only one was in Germany. Just understand.
[Andrej Filipcic] 15:11:23
David by the the official one that was accepted already. So they're going to proof is in Germany, Jupiter, Franz will likely come next year.
[Enrico Fermi Institute] 15:11:30
Okay.
[Ian Fisk] 15:11:32
Okay, nice okay.
[Andrej Filipcic] 15:11:33
I mean the call for for proposing
[Ian Fisk] 15:11:39
Thanks.
[Enrico Fermi Institute] 15:11:41
and from Maria
[Maria Girone] 15:11:43
maybe just I want to say that, hey, research infrastructure like, Okay, yay is, There's a lot of ongoing discussions.
[Maria Girone] 15:12:01
Sandra knows. Well, between let's say, the larger communities and your Hpc.
[Maria Girone] 15:12:09
In order to try to motivate further collaborations very much like those programs like this destination Earth, which indeed is a priority for European Commission.
[Maria Girone] 15:12:21
But we are also having a number of projects now that will allow us to do.
[Maria Girone] 15:12:28
Arindi. Some. I think we'll we, for instance.
[Maria Girone] 15:12:35
We will present tomorrow. Indie, So what we're doing with the d Eulich Super computer center for what concerns a development and use of a Gpu resources scale for distributed training the reason a pipeline in a also European project that will allow us to valuate open source
[Maria Girone] 15:13:00
solutions like risk 5 and sequel. So there are a number of opportunities and race side very well is very easy.
[Maria Girone] 15:13:09
Actually to work on the development side with your Hpc.
[Maria Girone] 15:13:13
And the we get granted the resources on for developers.
[Maria Girone] 15:13:20
So even within 5 days I mean, we didn't working week, so it's very, very, nice collaboration, at least at this, level, we need to build on this and go further, and that is less obvious.
[Maria Girone] 15:13:33
And we require your some common actions, let's say at least, when we are talking to your Hpc.
[Enrico Fermi Institute] 15:13:47
Samantha was first. Okay.
[simonecampana] 15:13:49
it's a follow-up on. Yes. Question. I think one of the requisites of for entering one of those special programs like I know.
[simonecampana] 15:14:02
I don't remember how they're called, but turn up, Grant based the more long term.
[simonecampana] 15:14:06
He's a first that you. You are an impactful science, and of course, so you know, it's really I'll be trying to define what is impactful.
[simonecampana] 15:14:17
But of course, it's the one who saves the health has a simpler way of demonstrating that impactful Also, if we want to apply for something like, this, I think it's important to make a a lot of progress on the software area.
[simonecampana] 15:14:32
because one of the other things one has to demonstrate is to use an Hbc.
[simonecampana] 15:14:38
For the value of an Hbc. And already don't use much of the interconnects. That An Hpc. Office.
[simonecampana] 15:14:45
So if we are also cheap on Gpus and the use of those architectures, then we become not such a great candidate for one of those one of those programs.
[simonecampana] 15:14:55
So. I think we have to build our story, and we have some technical improvements at the software level that now was to to build a better story, and my question is, if there is something like this in the Us.
[simonecampana] 15:15:10
because if there was, then we could try to build even a more coherent story across you.
[simonecampana] 15:15:16
Open the us.
[simonecampana] 15:15:22
Do you have a notion of sciences that get into a program once, and then there is a movie ear engagement with the Hbc facilities
[Ian Fisk] 15:15:31
Oh, I I think the dirt might be able to answer better, But I think this is one of the things that the Us.
[Ian Fisk] 15:15:37
There is a for definitely a push these days from the Us.
[Ian Fisk] 15:15:40
Funding agencies for science to make effect use of the large-scale computing facility.
[Ian Fisk] 15:15:48
And so, whether it it's not there's there's not a program like your Hbc.
[Ian Fisk] 15:15:53
Because it's only one country, but it does mean that there is a There is alert like if you look at where the national apps of main investments there.
[Ian Fisk] 15:16:03
A lot of the investments have been made in Central at facilities with the expectation that the calculations are done.
[Ian Fisk] 15:16:07
There.
[simonecampana] 15:16:08
Right.
[Enrico Fermi Institute] 15:16:09
Yeah, the the thing is, I mean at the moment they're very high level discussions going on that they because there's the push from the funding agencies that we should use more Hbc: And it's not just us it's in general because they pay for these facilities they want them
[Enrico Fermi Institute] 15:16:25
to be used, but that now is a pushback, and that's what the conversation is a very high level.
[Ian Fisk] 15:16:27
Good.
[Enrico Fermi Institute] 15:16:32
Lis mentioned something that there are groups talking about what they need to do in terms of changing their policies, to actually allow that because the the application process for the deed program for the lcc application process the inside application process is just not geared charles towards these used cases competitive poses that are unique that can only be done there which
[Enrico Fermi Institute] 15:16:56
is just not a good match, and that's it like this is above all pay scale.
[Enrico Fermi Institute] 15:17:01
Here. The these conversations are going on hopefully. Something comes out of it as we'll see
[Taylor Childers] 15:17:05
not sure that that is the case. I mean the so.
[Taylor Childers] 15:17:10
The insight program offers the opportunity to get up to 3 years of allocation through a competitive review process.
[Taylor Childers] 15:17:19
The challenge. Is laying out your case, and I would argue that the way you approach this for the leadership computing facility is you have to play to their mission right.
[Taylor Childers] 15:17:34
I mean, their mission is to provide the biggest computers, because people need them, not because Hi good
[Enrico Fermi Institute] 15:17:40
But Taylor is this is you basically have to sell it, and you have to sell it in there way that you basically dress it up as something that can.
[Enrico Fermi Institute] 15:17:48
You can only do there, and that's not what we want.
[Taylor Childers] 15:17:52
I agree, and but I would argue that you can easily make the case based on the fact that you are reaching great scenario, and if you don't get access to the machines, then you'll be able you'll be slower in your science, achievements, and I
[Enrico Fermi Institute] 15:17:52
Thanks.
[Taylor Childers] 15:18:14
think that's a viable, are you? I think the part where you guys have trouble in, especially an insight program proposal is the fact that you don't have enough of the workloads that take advantage Gpus, right?
[Enrico Fermi Institute] 15:18:31
Yeah, that's probably we're trying Lcc: right now.
[Taylor Childers] 15:18:31
I mean the challenges is
[Taylor Childers] 15:18:35
Yeah, for sure and
[Enrico Fermi Institute] 15:18:35
That's easier to justify. I think, if we ever get to the point that you could say, Okay, if we get like a huge insight proposal, we could, You can make the science use case if you can do something you couldn't otherwise, do because it basically adds, 50%, of your own capacity, or
[Enrico Fermi Institute] 15:18:50
whatever. But then a little bit kicks inside is still only you.
[Enrico Fermi Institute] 15:18:55
You do an allocation proposal, and you get the decision, and then you get it like 3 months later, a few months later, It's too short a time scale.
[Enrico Fermi Institute] 15:19:05
You would basically have to ask a year or 2 in advance to fit our planning process within the experiment, because you can't just drop that on top of Cms.
[Enrico Fermi Institute] 15:19:14
And expect that we basically throw our plans out the window.
[Enrico Fermi Institute] 15:19:18
And now effectively use
[Taylor Childers] 15:19:19
Yeah, no, there. There definitely needs to be more discussions above our pay rate.
[Taylor Childers] 15:19:25
I mean the challenge. There is to some extent you have to change how the leadership computing facilities are are reviewed so that we can accommodate stuff like that
[Enrico Fermi Institute] 15:19:44
you're so I'm sorry. The the you mentioned that obviously these machines are a mixed Cpu Gpu.
[Enrico Fermi Institute] 15:19:55
Next generation will they be more well, More of the flops and actual compute power power Use be in the accelerator realm, or will there be some machines where arm sort of provides the every lifting
[Andrej Filipcic] 15:20:14
well how does it say hard to predict? But that, in my opinion, there will be always machines built in that way, That's as many user communities can use them.
[Andrej Filipcic] 15:20:24
So, and several sites know that already. Right? So nobody will go to a complete dedicated machine, for example, even Jupiter, which is excel scale, It will be easier to build it up and reach the highest top 5 500 number only going only gpu right but they don't
[Andrej Filipcic] 15:20:45
want that I mean, nobody would actually want that. So on on cpus.
[Andrej Filipcic] 15:20:53
It depends right. But so, still quite many users are used in x 86, right so.
[Andrej Filipcic] 15:21:00
But arm is not so difficult in respect. If you use Cpu only part. When you have Gpu it will be slightly different, but arm will definitely be a larger players in the next couple of so something like that
[Enrico Fermi Institute] 15:21:14
but my take
[Enrico Fermi Institute] 15:21:14
But my takeaway from what you've just said is that at least for the next generation likely to have as a significant Cpu footprint, because they're sort of mandated to be as usable as possible to the communities the the broader comp broader, scientific and such communities
[Andrej Filipcic] 15:21:35
Right yup
[Enrico Fermi Institute] 15:21:36
right, Okay, thanks. Okay, let's let's move on So there's any other questions for hunger.
[Enrico Fermi Institute] 15:21:48
I think we should move on. Hey? Thank you.
[Andrej Filipcic] 15:21:50
Well, welcome!
[Enrico Fermi Institute] 15:21:52
Okay, we have a couple of slides from some European Cms.
[Enrico Fermi Institute] 15:21:57
Efforts, recipes. Daniela isn't connected, I think, unless if he's here, you should speak up.
[Enrico Fermi Institute] 15:22:04
He told me he couldn't. So this is this.
[Enrico Fermi Institute] 15:22:08
Is integration basically at the Seneca, at the Canal Tier one.
[Enrico Fermi Institute] 15:22:13
So they have the co-locator visiting the same data center.
[Enrico Fermi Institute] 15:22:18
There is the Seneca Mcconnie, 100 Hbc.
[Enrico Fermi Institute] 15:22:21
Which is basically a clone in terms of system architecture to to summit.
[Enrico Fermi Institute] 15:22:26
So it's power plus and video, and they they integrated it as a subside of the Tia one.
[Enrico Fermi Institute] 15:22:32
So since they're co-located on the same data center, they have really fast network interconnect.
[Enrico Fermi Institute] 15:22:37
They tie it together. The Hbc. Can see basically the kind of T.
[Enrico Fermi Institute] 15:22:42
One storage system. You saw The services are provided by the data center, and they run it as a subset of the T one.
[Enrico Fermi Institute] 15:22:52
So the Cms operations only sees the T one, and then they can internally via some pilot customizations they can select, which parts of the workflow that are centered at Tijuana can run on the Hpc.
[Enrico Fermi Institute] 15:23:05
Side, And they basically where we are today, it says, almost complete.
[Enrico Fermi Institute] 15:23:09
I think it is complete now, because the announcement came out after the slide was sent to me.
[Enrico Fermi Institute] 15:23:15
You see, some slides how it's how it's integrated. So on.
[Enrico Fermi Institute] 15:23:18
Did you see it in the in the Monet? The sub-site concept has some unique challenges in how you monitor it.
[Enrico Fermi Institute] 15:23:27
Good.
[Paolo Calafiura (he)] 15:23:27
I'm sorry. Are we looking at lights because we these slides
[Enrico Fermi Institute] 15:23:31
I that I forgot to re share it. Oh, that is yeah, I mean, bring it back up the because in the U.
[Paolo Calafiura (he)] 15:23:33
Alright.
[Enrico Fermi Institute] 15:23:41
S. All the Hpc sites they're using.
[Enrico Fermi Institute] 15:23:44
We basically put the concept of it. See a 3 grid side on top of it, which makes the monitoring and accounting, and so on really easy, because everything is important and there's a unique sign if you have a subside, and is a little bit more difficult because everything is kind of hidden in the under the umbrella, of
[Enrico Fermi Institute] 15:24:02
the T one, and then you have to kind of dig into this like some subfields and identify us to, and there has been some work on going in the monitoring and the monitoring sites on Cms to to make that easier Doesn't.
[Enrico Fermi Institute] 15:24:16
This model make it easier to accommodate. It makes it makes perfect sense, I mean, for them.
[Enrico Fermi Institute] 15:24:25
It's great because they're I mean, they're co-located anyways.
[Enrico Fermi Institute] 15:24:29
It's it makes perfect sense for it's a bit more difficult if you're like geographically, is organizationally separate and entities.
[Enrico Fermi Institute] 15:24:40
So in the Us. It's kind of difficult, because the Hbc are usually stand alone, So So what is it that has changed between few years ago And now?
[Enrico Fermi Institute] 15:24:51
With regard to Cbm Fs. It seems like initially, people were very, very wary of it, You don't want to put this on our Hpc.
[Enrico Fermi Institute] 15:24:58
Because it'll crash everything or whatever I mean. Is it is it Technology has gotten better?
[Enrico Fermi Institute] 15:25:02
Or is that people have gotten less afraid of it? Maybe people have got less of for it just became familiar with Also, a lot of people are using it, not just us that.
[Enrico Fermi Institute] 15:25:12
Helps and then I don't worry about it anymore, because any recent machine with the recent Os no problem running Cdm: Fs: access Yeah.
[Enrico Fermi Institute] 15:25:24
It just built my own, and I mean from the ocean side we bring on new sites, because we only use see?
[Enrico Fermi Institute] 15:25:31
If you have a phone, Zack, and only if they directly ask can I please run?
[Enrico Fermi Institute] 15:25:37
Cvs or if they have any other problems, do we give them the option by Why, why have that conversation? Sure somebody's not hitched into it?
[Enrico Fermi Institute] 15:25:53
Okay, even on the Lcf. No problem. Be it worked on, Sayta out of the box.
[Enrico Fermi Institute] 15:25:58
Physically it worked on summit, out of the box so I didn't know the issues.
[Enrico Fermi Institute] 15:26:01
I have to go click at the squid there so I can actually write it on the batch node.
[Enrico Fermi Institute] 15:26:05
But it worked on the logarithm, which runs the same operating system
[Enrico Fermi Institute] 15:26:10
And then, Antonio, are you connected? Okay, So Antonio can say a few words on what we're doing at Marinosa
[Antonio Perez-Calero Yzquierdo] 15:26:12
Hi! Yes, I am! Can you hear me?
[Antonio Perez-Calero Yzquierdo] 15:26:17
Yeah, okay, So yeah, I don't know. Zoom is for is the current supercomputer in?
[Antonio Perez-Calero Yzquierdo] 15:26:26
and this is the the largest Sbc center in Spain.
[Antonio Perez-Calero Yzquierdo] 15:26:29
I don't know through 5 days. Plan is actually in the procurement face as a explain before so we are accessing bsc and Madamos room as a project mediated by pick So that's the double CD.
[Antonio Perez-Calero Yzquierdo] 15:26:48
Spanish tier one, and fortunately, let's say, interestingly, the Lc.
[Antonio Perez-Calero Yzquierdo] 15:26:54
Computing has been designated as the strategic project in Vsc program.
[Antonio Perez-Calero Yzquierdo] 15:26:58
So this means that we basically well, we still have to request the allocation.
[Antonio Perez-Calero Yzquierdo] 15:27:02
But we are getting quarterly grants of about 6 or 7 million hours.
[Antonio Perez-Calero Yzquierdo] 15:27:09
A Yeah, available at this for for Cms. And I think it's about the same amount, for for Atlas.
[Antonio Perez-Calero Yzquierdo] 15:27:16
So we are getting these allocations. Let's say, regularly.
[Antonio Perez-Calero Yzquierdo] 15:27:19
Okay, however, the case is very difficult for for Cms.
[Antonio Perez-Calero Yzquierdo] 15:27:24
The environment is extremely challenging, because well, for security reasons, no incoming or outgoing connectivity is allowed in the compute notes.
[Antonio Perez-Calero Yzquierdo] 15:27:36
this means that well, everything that needs to happen for for the same, it will run a job like what I have in now, on the on the right hand side, accessing what being connected to the water management being able to to access the software of course conditions data and finally access to storage all these things
[Antonio Perez-Calero Yzquierdo] 15:27:54
are at a Yeah, a cat. Basically, all this connection, even, we have in recently discussing the possibility of having some added the it's services.
[Antonio Perez-Calero Yzquierdo] 15:28:07
And this is not the not not even this is is allowed.
[Antonio Perez-Calero Yzquierdo] 15:28:11
So of course, I shall stop at 4 for Cms, as tasks require.
[Antonio Perez-Calero Yzquierdo] 15:28:15
Stephen, services such as the ones I. That is correct.
[Antonio Perez-Calero Yzquierdo] 15:28:20
What we have is a login note which allows a site and a share file system mounted on on the execute notes.
[Antonio Perez-Calero Yzquierdo] 15:28:29
And And yeah, we can access this. This distributed file system. Be Sh: Fs: So what we are doing Well, he use these capabilities to to build the the model that you can see in the next slide which requires a sensible substantial amount of integration.
[Antonio Perez-Calero Yzquierdo] 15:28:49
Work, Yeah, So what the components that that we have, let's say in our favor to make this thing work is, first of all is the condor split Startup.
[Antonio Perez-Calero Yzquierdo] 15:28:57
So it uses the the share file system as a communication layer for the job.
[Enrico Fermi Institute] 15:29:02
Yes.
[Antonio Perez-Calero Yzquierdo] 15:29:02
Management. Well, you can see. Yes, Abc. And D.
[Antonio Perez-Calero Yzquierdo] 15:29:10
In the in the in the diagram below, where basically condor is kind of well, it's communicating between the study, and they actual starter where they were.
[Antonio Perez-Calero Yzquierdo] 15:29:19
The job run let's say, is communicating via passing files through the file system.
[Antonio Perez-Calero Yzquierdo] 15:29:23
Okay, then for software, what we do is basically replicate the Cbm Fs and repositories and Bsc: we, we we get what we need a peak.
[Antonio Perez-Calero Yzquierdo] 15:29:34
And then basically send the files and and be in the environment What are the Nbsp.
[Antonio Perez-Calero Yzquierdo] 15:29:40
For the conditions. Data is, we cannot access a databases, remote databases.
[Antonio Perez-Calero Yzquierdo] 15:29:46
We have to pre fetch those conditions, make them into files, pretty, place them into Bsc.
[Antonio Perez-Calero Yzquierdo] 15:29:51
And finally for storage concerns, we have developed our own service for input and output data transfers initially for output. Now for the stage out, let's say now, we are also commissioned this for for like.
[Antonio Perez-Calero Yzquierdo] 15:30:08
That so it's kind of white comboluted.
[Antonio Perez-Calero Yzquierdo] 15:30:12
The system you can see on the 2 2 extremes of on the diagram, Cern: Of course, the Cms water management system the storage, etc.
[Antonio Perez-Calero Yzquierdo] 15:30:20
And, on the other hand, the Bsc. And how we have to build all this intermediate layer at the up.
[Antonio Perez-Calero Yzquierdo] 15:30:27
Pick this bridge. Okay, next, please. Yeah, So what's the current status?
[Antonio Perez-Calero Yzquierdo] 15:30:35
Okay, system. The system works the services, and infrastructure that we have deployed.
[Antonio Perez-Calero Yzquierdo] 15:30:41
this as a allowed us already to to run a test, very reasonable scale.
[Antonio Perez-Calero Yzquierdo] 15:30:47
15,006 Cpu cores in in modern option 5.
[Antonio Perez-Calero Yzquierdo] 15:30:51
This is realistic Cms jobs, and this is an integrated.
[Antonio Perez-Calero Yzquierdo] 15:30:56
Well, I aggregate the output rate of 500 megawatts per second.
[Antonio Perez-Calero Yzquierdo] 15:31:00
Okay, So it's capable of sustaining societies.
[Enrico Fermi Institute] 15:31:03
Yeah.
[Antonio Perez-Calero Yzquierdo] 15:31:03
So the staging out works is commissioner and ready as I'm as I'm mentioning.
[Antonio Perez-Calero Yzquierdo] 15:31:12
Yup, probably. Let's say now, Okay, it's actually in discussing this Cms: workloads that can fit into this model.
[Antonio Perez-Calero Yzquierdo] 15:31:21
And with the constraint that I explained before. So what we, what would I call realistic Senior Cms workloads so far, this tests are Gen: same task change jobs.
[Antonio Perez-Calero Yzquierdo] 15:31:32
For example, in this case, minimum bias production. So it means there is no access.
[Antonio Perez-Calero Yzquierdo] 15:31:38
Cool, Okay, or there is no input, data. A full simulation, however, in the style that same as mostly performs, is in the form of a step chain.
[Antonio Perez-Calero Yzquierdo] 15:31:49
So it's a single single condor job running all the 4 stages since him did there Rigo, Where in the in this 2, stages they pile up libraries are access be enterprise.
[Antonio Perez-Calero Yzquierdo] 15:32:04
Okay, so we can have a triple A, So what we could do in order to be able to run this full step chain is to copy the premix data samples into the Ac.
[Antonio Perez-Calero Yzquierdo] 15:32:15
we have, let's say, ask about that this possibility.
[Antonio Perez-Calero Yzquierdo] 15:32:19
But but okay, copying data sets all the size about the of about the petabyte.
[Antonio Perez-Calero Yzquierdo] 15:32:28
It's not the currently allowed. There's no, there's no capacity in the Karate marinosum for that. Perhaps in modern option.
[Antonio Perez-Calero Yzquierdo] 15:32:35
5 dimension, but not that at present. Okay, So that rules out this type of phone simulation.
[Antonio Perez-Calero Yzquierdo] 15:32:41
What to look, let's say, and then what we are doing right now is commissioning this stage.
[Antonio Perez-Calero Yzquierdo] 15:32:48
The stage in right. So So this customize data transfer service in order to push files from pick a storage for simple, but it even we could get through triple a into peak and then They are into Bsc.
[Antonio Perez-Calero Yzquierdo] 15:33:02
In order to enable and running workflows which require input data.
[Antonio Perez-Calero Yzquierdo] 15:33:05
For example, we are thinking of participating or enabling broader reprocessing at the admiration.
[Antonio Perez-Calero Yzquierdo] 15:33:14
And this is the current situation It's it's not only okay.
[Antonio Perez-Calero Yzquierdo] 15:33:19
Let's say, in relation to many things that have been discussed so far.
[Antonio Perez-Calero Yzquierdo] 15:33:23
Yeah, in this workshop. It's not only the the the capabilities that that we are allowed, or that actually that we're not allowed to to to have a Dsc: together with how Cms operates for example, step chains are preferred over does change right So this already restricts
[Antonio Perez-Calero Yzquierdo] 15:33:45
very much what we can do in in Bfc. I think that's that's it.
[Enrico Fermi Institute] 15:33:53
I just wanted to have a call
[Enrico Fermi Institute] 15:33:55
I just wanted to have a comment. This was but Antonio showed the split, starter method that this HD Corner integration.
[Enrico Fermi Institute] 15:34:01
That's actually what we did. What we used for the Lcf.
[Enrico Fermi Institute] 15:34:06
Theta, integration, the prototype integration that we used during the the 2120, 21 Lcc.
[Antonio Perez-Calero Yzquierdo] 15:34:10
Yeah.
[Enrico Fermi Institute] 15:34:14
It worked that, too. It's it's a little simpler there even then, because you, since you do have edge services that you can call out from the edge.
[Enrico Fermi Institute] 15:34:22
So certain things are not quite as complicated as Pcs.
[Enrico Fermi Institute] 15:34:25
But we followed the same general integration, principle.
[Antonio Perez-Calero Yzquierdo] 15:34:28
yeah, that our case I don't know. I I would say it's particularly interesting because we are really being asked and enforce a right.
[Antonio Perez-Calero Yzquierdo] 15:34:37
We we have been asking false to use marinos room bye, from the funding agency point of view.
[Antonio Perez-Calero Yzquierdo] 15:34:44
Right, I mean, Oh, yeah, we have the the notion that Cpu is going to be got in in in further incoming request, let's say, funding requests for for our Lc computing projects how about on the other hand Bsc: is not very friendly in terms of allowing things that will
[Antonio Perez-Calero Yzquierdo] 15:35:06
make the integrate.
[Enrico Fermi Institute] 15:35:07
And at the funding agency have no no way to influence Pcs.
[Enrico Fermi Institute] 15:35:11
They can just say No, we don't
[Antonio Perez-Calero Yzquierdo] 15:35:13
It's okay. Yeah, it's like A, It's kind of I don't know.
[Antonio Perez-Calero Yzquierdo] 15:35:15
I see it as kind of paradoxical, because really we're kind of been trapped between the 2 forces squeezing us in the in the middle.
[Antonio Perez-Calero Yzquierdo] 15:35:23
Right? So yeah, it's it's making it quite a lengthy and and and at those project to integrate this.
[Antonio Perez-Calero Yzquierdo] 15:35:32
Well, we are advancing. We are trying actually to make it as universal as possible.
[Antonio Perez-Calero Yzquierdo] 15:35:37
Let's say, in in relation to Cms workflows, because otherwise it would not be able to.
[Antonio Perez-Calero Yzquierdo] 15:35:44
We will not be able to use the resource. But again, it's it's it's difficult
[Enrico Fermi Institute] 15:35:50
Okay, Any other questions, comments.
[Ian Fisk] 15:35:56
I I had one which which is sort of to Antonio, and sort of, I think, to the larger group which is, do we, hey, Chris?
[Ian Fisk] 15:36:04
We want to take advantage of sort of the Wlcp.
[Enrico Fermi Institute] 15:36:06
It's
[Ian Fisk] 15:36:09
And the sort of the larger organization structures that we have to basically say that network connectivity downside is some is is necessary to work.
[Ian Fisk] 15:36:21
I think it's it's really it's it's very impressive technical work to be able to go around this.
[Ian Fisk] 15:36:25
But this is something that we could sort of like. I wonder if there'd be any benefits sort of pushing from Mobile
[Antonio Perez-Calero Yzquierdo] 15:36:34
yeah, I'm not. I'm not usually involved in the in the political discussions.
[Antonio Perez-Calero Yzquierdo] 15:36:40
so I I couldn't tell myself. I I don't know if see money, for example, with the we provide
[Enrico Fermi Institute] 15:36:47
I mean, I mean the one thing, Antonio. If you said that they want to reduce your funding for great computing and replace it with Hbc.
[Enrico Fermi Institute] 15:36:55
I mean at that point they need at that point I think they expect that that Hbc allocate the capacity kind of counts as a replacement, and don't they need.
[Antonio Perez-Calero Yzquierdo] 15:36:55
Yeah.
[Enrico Fermi Institute] 15:37:07
Like Ws. G agreement at that point, but they actually consider this to be an equivalent replacement
[Antonio Perez-Calero Yzquierdo] 15:37:13
yeah, in principle, the idea is that for for Cpu intensive workloads estimated at about 50% of the Cpu requirement.
[Antonio Perez-Calero Yzquierdo] 15:37:24
No request, 50% would be provided by by the Yeah, by Cpc.
[Antonio Perez-Calero Yzquierdo] 15:37:30
And then we still would have some Cpu for data processing.
[Antonio Perez-Calero Yzquierdo] 15:37:34
Let's say, for the usual Oh, there'd be a one that's kind of the idea.
[Antonio Perez-Calero Yzquierdo] 15:37:40
But in order to do that, the yeah, like, I said, we, we we are being forced a kind of to transform this into a and universal resource, which is, which is not yeah, is very much not so
[simonecampana] 15:37:52
yeah, to commit to comment. Several people talk to to the funding agents, including myself, talk to the funding agency and the pick, and also to be a C.
[Enrico Fermi Institute] 15:38:02
Yeah.
[simonecampana] 15:38:05
But it seems to be a triangle that doesn't really understand each other.
[simonecampana] 15:38:11
So I think what Antonio is saying is correct. They're trying to push this on.
[simonecampana] 15:38:17
The throat, and of course we are trying to push back. Now.
[simonecampana] 15:38:21
Of course funding agents. Is not obliged. The pledge right?
[simonecampana] 15:38:27
I mean good the funding. It just says, Okay, this is the money we have. And you know, if you want X Tab, you can not okay to use this
[Ian Fisk] 15:38:35
Okay, I guess some money that point with, and my point was sort of like.
[Ian Fisk] 15:38:40
Did we want to? When we're writing the email we set in relatively strict criteria about, but services needed to be run, and what the expectations were in terms of quality of service, and availability but also in the development of the protocol and this occurs to me as a place where
[Ian Fisk] 15:38:58
like the Wsg. Could decide that one of the protocols that's necessary to be considered a site is this?
[Ian Fisk] 15:39:05
And it doesn't. It's not guaranteed to work.
[Ian Fisk] 15:39:07
But I think that in some sense exciting without it is almost guarantees that it will not change
[simonecampana] 15:39:14
Yeah, I mean, it would be useful if the I would say the peak management would make on these for my former request to Wcg: because a a reality what pick has done is to do a lot of diligent work to try to overcome the limitations.
[Ian Fisk] 15:39:32
Right.
[simonecampana] 15:39:34
hey? It would be good if this would be the other way around, and at some point they would say, cannot do so.
[simonecampana] 15:39:41
We can not offer tier one services with this piece of facility, and then we would have a discussion with the funding agency on on those basis at the moment those discussions they led, the not too much, to be honest I don't know if Antonio has more detail that's what I understand also from
[Enrico Fermi Institute] 15:39:54
Okay.
[simonecampana] 15:39:58
pip
[Enrico Fermi Institute] 15:39:59
Could be try to move on and maybe move that offline, because it's not yeah, it's interesting, but it's also it's internal Wsg Spanish funding agency.
[Antonio Perez-Calero Yzquierdo] 15:40:10
yeah, Thank you.
[Enrico Fermi Institute] 15:40:11
So that's not relevant to the I think we have one more presentation and then we still need to have the cost discussion.
[Enrico Fermi Institute] 15:40:16
Yeah. So running a little late. Yes, yeah, let's let's move on Taylor.
[Enrico Fermi Institute] 15:40:24
Do you have slides for us
[Taylor Childers] 15:40:28
Yeah, I have a few slides
[Enrico Fermi Institute] 15:40:30
Okay, great.
[Taylor Childers] 15:40:36
Hey!
[Enrico Fermi Institute] 15:40:40
Right.
[Taylor Childers] 15:40:41
So Hi, this is a disclaimer. This is a disclaimer to make sure I don't do anything silly, but you know the point is this: my own outlook.
[Taylor Childers] 15:40:52
On the future. I I'm not presenting any inside information about Yeah, I I don't even know what's coming after.
[Taylor Childers] 15:41:00
Aurora. There are people at Argon that do, but not me.
[Enrico Fermi Institute] 15:41:06
But but Aurora is still coming right. That's
[Taylor Childers] 15:41:08
Yeah, if there's anything is real that Aurora is still coming. That's been the case for far too long.
[Enrico Fermi Institute] 15:41:21
still coming.
[Taylor Childers] 15:41:22
Yeah, it's still coming. Okay? So going back and updated this plot from a long time ago to provide provide a quick update where things are in the Us.
[Taylor Childers] 15:41:38
we've talked about this at length. At this point but I think it's also useful to look at it in the context of the Lhc.
[Taylor Childers] 15:41:48
Runs right By the time the high Lumi Lhc turns on, we're gonna be dealing with the machines.
[Taylor Childers] 15:41:54
We don't even know what they look like yet, and a lot can happen between now and then that can affect how those machines look.
[Taylor Childers] 15:42:05
So we now have frontier deployed, so the Us.
[Taylor Childers] 15:42:11
Has its first ex- scale machine. We'll have Aurora coming online by the end of the year, and the next generation, which machines, you know, like, I said, we don't know what those are everything that we have is.
[Taylor Childers] 15:42:26
Sort of intel Nvidia Amd. I would come expect these to follow similar trends amazingly because of politics of it.
[Taylor Childers] 15:42:37
All right. I mean, we're spending us taxpayer money, and they want that to go to us corporations.
[Taylor Childers] 15:42:44
so I expect those will stay static. But of course, the variation in combinations, you can already see, are quite large, so those can still change
[Taylor Childers] 15:43:00
just a quick Put that in perspective. So I included the Japanese recent machine that they deployed the European machines that are have been announced I'm pretty sure there that this was confirmed in andrea slide or the slides on the euro.
[Enrico Fermi Institute] 15:43:20
Yeah.
[Taylor Childers] 15:43:25
Apc. That there's gonna be one more X access.
[Taylor Childers] 15:43:29
Give a machine, announced. So we know Jupiter is coming, and the plan was all right always to have €2 Hpc excel machines before 25.
[Taylor Childers] 15:43:42
I include China on here in principle they already have 3 ex scale machines, and in 10 to have 10 by 2425.
[Taylor Childers] 15:43:52
That's their goal. There's no reason they can't do that.
[Taylor Childers] 15:43:55
They seem to be willing to burn as much coal as possible to keep these machines at the Exa scale.
[Taylor Childers] 15:44:02
as I understand it, this one is just a giant.
[Taylor Childers] 15:44:05
Oh, no! That Tiana 3 is a giant upgrade of the 2.
[Taylor Childers] 15:44:09
So it's just a bunch of cpus, and there is no energy budget there.
[Taylor Childers] 15:44:13
So it's you know, a hot machine. The interesting thing about all of these is that they have various architectures that are very different.
[Taylor Childers] 15:44:28
Europe, has gone heavy into arm and eventually will go into the risk.
[Taylor Childers] 15:44:33
V. As an open source, accelerator format.
[Taylor Childers] 15:44:37
they're also, you know, into the sovereign.
[Taylor Childers] 15:44:42
Technology is. Everybody wants to, You know, there's stuff built here.
[Taylor Childers] 15:44:47
so the Japanese are using fruitsu chips.
[Taylor Childers] 15:44:51
The Europeans are trying to design their own I wouldn't be surprised if the arm and the risky stuff changes in the year in you, because I know you know Intel has already announced they're gonna open some boundaries in Europe and I think that's kind of help their image in the
[Taylor Childers] 15:45:11
area, so we'll see
[Taylor Childers] 15:45:16
So just a quick that look at at the distribution of of architectures.
[Taylor Childers] 15:45:22
So I took the top. 500. I made the cut off, and had to be bigger than 10 Petaflops.
[Taylor Childers] 15:45:28
That leaves me at about 50 machines, and I just flaps with the architectures frontier, really heavily dominates this now, so you can see, you know, The Amd, cpus and gpus from an ex scale machine compared to Everyone else.
[Taylor Childers] 15:45:47
so you can see right now, you know, outside of frontier in videos, really dominating the accelerators, there's a nice distribution of of cpus, and then I went ahead.
[Taylor Childers] 15:46:03
To 26, and tried to do the same plot.
[Enrico Fermi Institute] 15:46:05
Okay.
[Taylor Childers] 15:46:10
For what I think is coming. So by 2026 Us.
[Taylor Childers] 15:46:16
And Europe will both have 2 X and scale machines, like said China will have up to 10.
[Taylor Childers] 15:46:20
I didn't include the Chinese in this number largely because I mean, I have no idea the technique technicalities of what they're going to be running.
[Taylor Childers] 15:46:32
You're up has at least put out a roadmap, so their goal is to be using these arms, and the risky accelerators.
[Taylor Childers] 15:46:41
So if I include those at sort of, you know, over an exaflop.
[Taylor Childers] 15:46:48
then you start seeing this distribution. So you see, there's arm amd intel on the Cpu side, and a Amd.
[Taylor Childers] 15:47:00
Intel, And then this is essentially that risk v processor?
[Taylor Childers] 15:47:04
So if the Europeans decide to move to Nvidia or Intel, or Amd.
[Taylor Childers] 15:47:11
This green blob here will shift so you can see the The variation is, you know, early equal.
[Taylor Childers] 15:47:24
So then there's specialty hardware. So the du is has always been strong at in partnering with industry.
[Taylor Childers] 15:47:33
We really like pushing collaborations with industry. Alcf.
[Taylor Childers] 15:47:40
Host, the Doe Ai Test band, and currently we have 5 machines that are all custom silicon that are designs for running large learning jobs And so we've been working with those developers testing out their software And whatnot there's definitely an interest in identifying one or
[Enrico Fermi Institute] 15:47:44
Okay.
[Taylor Childers] 15:48:04
2 that you know, scientists like best, and then moving along with maybe making those as side car side cards to some future supercomputer. Right?
[Taylor Childers] 15:48:18
So you could imagine having the, you know, a couple of racks of these specialized chips available to you, to run your your Ai much much faster than a traditional Gpu or Cpu the other thing I wanted to say moving forward i'm close by Dorothea my kids are coming home to
[Taylor Childers] 15:48:43
school. The other thing is I wanted to mention was, of course, Ai for science, and in the context of Ecp so many of you Will be familiar with Ecp: The ex scale computing project Yeah.
[Enrico Fermi Institute] 15:49:01
Cool.
[Taylor Childers] 15:49:01
Was a large funded project on the Oscar side that you know The last number I heard is in principle.
[Taylor Childers] 15:49:13
It funded about a 1,000 ftees across the and it was all geared toward preparing for ex scale machines.
[Taylor Childers] 15:49:24
now with the landing of our 2 access can systems, This project's going to be ramping down, and there's a lot of worked to figure out what's going to come next.
[Taylor Childers] 15:49:39
And it really looks like Ai, for science is the next big push, so they're already.
[Taylor Childers] 15:49:46
It's already been 2 years now worth of workshops.
[Taylor Childers] 15:49:50
on the Oscar side, where we are trying to lay out the green ground Work for what such a project would look like, and how it would be managed, and what its goals would be so I expect that in the next you know 5 years that this is gonna be sort of a dominating.
[Taylor Childers] 15:50:13
force, just like Ecp. Was so just something to be aware of.
[Enrico Fermi Institute] 15:50:15
Thank you.
[Taylor Childers] 15:50:19
I think that's going to have a big impact on it.
[Taylor Childers] 15:50:24
How our systems look Yeah, in this next round of deployments.
[Taylor Childers] 15:50:31
So? Are there any. So the takeaways, I would say, future of architecture, and hpc facilities is quite diverse.
[Taylor Childers] 15:50:40
I expected to remain so, There might be some custom hardware, but it will be very niche is what I expect for Ai, and you'll just be picking up tensorflow and pike torch and running your software the way You would anywhere else.
[Taylor Childers] 15:50:54
I would say the software implications There are the using portable frameworks will be a benefit, and of course, the more we can complain and and voice our are interest in a standard support theme through the C standard.
[Taylor Childers] 15:51:16
2 companies I think that you know it's a good thing, but until everyone supports something like Std.
[Taylor Childers] 15:51:23
Par out of C standard, you know, using these third party libraries like cocos and Sickle and Peca, are probably gonna be the best way to go for the moment let's see, current ex scale machines.
[Taylor Childers] 15:51:38
I were largely decided before Ai became a real focus.
[Taylor Childers] 15:51:43
And do we science? And I expect that to be a bigger driver for the next round of systems that are coming that might again, of course, with the end, is in the energy budgets and competitive nature of these machines will probably driving them in the direction accelerators again, but things?
[Taylor Childers] 15:52:07
Shift quickly. It's hard to predict. So yeah, that's where I I leave that
[Enrico Fermi Institute] 15:52:19
But Tara had a quick question. I think it's on slide 3 where he kinda made the pie charts of.
[Enrico Fermi Institute] 15:52:26
yeah, if if you would try to make a single pie chart right?
[Enrico Fermi Institute] 15:52:33
If it's the problem pie charts, you can't tell the relative size how much larger is the Gpu flops currently versus the the Cpu flop.
[Enrico Fermi Institute] 15:52:42
Is there? Is there a way to get a don't all to to a single one?
[Taylor Childers] 15:52:48
Yeah, I mean. So any system that has accelerators can be dominated right Last time I calculated that was like probably was Summit, and there was, you know, on the level, 5 to 10 with Cpu flops.
[Enrico Fermi Institute] 15:52:54
Yeah.
[Taylor Childers] 15:53:06
and it got even worse whenever I did. The calculation for frontier and Aurora.
[Taylor Childers] 15:53:13
But it's been a long time since I looked at those
[Enrico Fermi Institute] 15:53:17
So I guess the point is, if it was drawn to scale like the Gpu pie chart would be 10 times larger than the Cpu, or 5 times 10 times not not the same size right
[Taylor Childers] 15:53:23
That's right.
[Taylor Childers] 15:53:29
For sure, for sure.
[Enrico Fermi Institute] 15:53:32
And and you're timing in what? What is other of the Gps here?
[Taylor Childers] 15:53:36
So
[Enrico Fermi Institute] 15:53:38
Is that the
[Taylor Childers] 15:53:40
Yeah, So that would be in this case. That would be the fidget suit
[Enrico Fermi Institute] 15:53:47
Okay.
[Taylor Childers] 15:53:50
I can look back in my spreadsheet, too.
[Enrico Fermi Institute] 15:54:01
They probably also explains why Barb is a larger piece than Kelvin
[Taylor Childers] 15:54:06
Oh, no! Sorry. In this one. The other is the T. On a 2, which is on the 500, and if one of these it's this one
[Enrico Fermi Institute] 15:54:13
Okay.
[Enrico Fermi Institute] 15:54:21
Okay, if you told me that was a 386 ship, I'd also believe you.
[Enrico Fermi Institute] 15:54:26
So okay, So Taylor performance portability. So if they does, that mean if it is, decide on a system design, they make the Lcf.
[Enrico Fermi Institute] 15:54:39
Or whatever fun stuff makes sure that it's supported by the performance.
[Enrico Fermi Institute] 15:54:44
Portability, libraries.
[Taylor Childers] 15:54:46
Well, and I think that's the benefit of something like Co.
[Taylor Childers] 15:54:51
Coast, which is a really it's a third party, the support right?
[Taylor Childers] 15:54:55
So Cocos came out of the Ecp project, and I imagine we'll continue to be supported.
[Taylor Childers] 15:55:06
and since it's third party, they can just come in and write a new plugin for whatever you know New Orleans comes along, and so as long as you use it, you paying the benefit from that I was when we first got we first, we're working with intel and sickle I was
[Taylor Childers] 15:55:31
very skeptical of sickle I mean, I'm in general.
[Taylor Childers] 15:55:35
I'm so skeptical of especially telling scientists to invest their time in the solution that's being pushed by one of the manufacturers.
[Taylor Childers] 15:55:47
Right I mean Cuda is a mess as a You know, someone who came up in in the sciences writing code.
[Taylor Childers] 15:55:55
I would never wish anyone to write code in Cuda, and so I approach sickle in the same respect.
[Taylor Childers] 15:56:07
but I mean it's getting good performance and it allows you to write your code once, and so far we've been able to run it on all 3 systems.
[Taylor Childers] 15:56:17
We run it, at least with Matt Graph. We have a sickle implementation, and it runs on the Amds, the Intel, and the Nvidia Gpus without any problem, and does very well, and Cocoa is the same with and like you said the nice thing about those 2 is that you write
[Enrico Fermi Institute] 15:56:32
See.
[Taylor Childers] 15:56:37
your code once, but with cuda the coulda implementation of ad graph right now is a riddled with compiler pre-compiler if depths everywhere, because if you're not on a computer device you need to run the C and they you know, it just becomes really hard to
[Taylor Childers] 15:56:56
maintain for someone who's not the dedicated software
[Enrico Fermi Institute] 15:57:07
Still have to cover the Hpc. Cost, I would like to at least attempted it to go through the slide where you have to see.
[Enrico Fermi Institute] 15:57:14
Okay, there's too long. Eventually we might have to cut it off and move it to tomorrow or something.
[Enrico Fermi Institute] 15:57:18
Yeah, we could could start a little earlier tomorrow I don't know how people feel about that.
[Enrico Fermi Institute] 15:57:24
Yeah, thanks, Taylor. Appreciate it. So let's try to go to the Hpc.
[Enrico Fermi Institute] 15:57:32
Cost, and then we're right up on the Yeah.
[Enrico Fermi Institute] 15:57:35
There was a question on the charge or remember to share.
[Enrico Fermi Institute] 15:57:41
At this time the total cost of operating Hbc resources, and they especially included the the outlook to each, and and the thing is the the cost of operating it I mean This is really about operation acquiring and operating because you nominally they're free I mean
[Enrico Fermi Institute] 15:58:02
eventually there's some indirect effect, because you get them from the same funding agencies.
[Enrico Fermi Institute] 15:58:07
That fund you purchase hardware, but that's indirect, and that's also also the scope of this in this workshop.
[Enrico Fermi Institute] 15:58:14
So you you basically have to prepare your proposals once per year, usually access allows supplementals.
[Enrico Fermi Institute] 15:58:22
there's work on multi year proposals, and maybe that will mean that you still have to do a proposal each year.
[Enrico Fermi Institute] 15:58:30
But you don't have to do much work for it.
[Enrico Fermi Institute] 15:58:31
You just sign it off with your request. You already know what you're getting, and but this is a work in progress, and then there's technically integration, permissioning Mark and that's mostly one time.
[Enrico Fermi Institute] 15:58:43
is it you you integrate a facility once you find a way to make it work, and then you just have to maintain what you came up with, and this needs to be redone every free year.
[Enrico Fermi Institute] 15:58:56
Because these Hbc have a limited lifetime.
[Enrico Fermi Institute] 15:58:58
Basically, 5 years is around the maximum expect replace it with a different machine.
[Enrico Fermi Institute] 15:59:03
The what we experienced so far is the synergy effects.
[Enrico Fermi Institute] 15:59:07
If you stay within the same facility, because usually they have similar restrictions similar ways to do things so switching from one to to another cluster in the same facility, that when they do a replacement you you don't have to throw out everything and stuff from scratch you just make adjustments to what
[Enrico Fermi Institute] 15:59:27
you probably did before. It's there's an open question on the Lcf.
[Enrico Fermi Institute] 15:59:34
Integration, at least for a Cms Side I mean, you have your harvesteds for us at least long-term operational overheads.
[Enrico Fermi Institute] 15:59:42
There, a little harder to estimate They're likely also larger there, because the provisioning integration looks like it's gonna be a bit more complex, and not tight neatly into what we're doing anyways.
[Enrico Fermi Institute] 15:59:57
For the good size, So you need to do something special. Then support.
[Enrico Fermi Institute] 16:00:02
I mean, that's one of the things that came up in the context of pledging.
[Enrico Fermi Institute] 16:00:07
It's something you need to be able to send a ticket.
[Enrico Fermi Institute] 16:00:11
So there's operation support, because you don't have less Cms side contact.
[Enrico Fermi Institute] 16:00:16
Now, admittedly the grid says, Dt. Twos.
[Enrico Fermi Institute] 16:00:19
The side context is also someone usually the operations program baseball.
[Steven Timm] 16:00:23
hmm.
[Enrico Fermi Institute] 16:00:24
Is not that this is necessarily cost. That's unique to the Hbc.
[Steven Timm] 16:00:29
well, that
[Steven Timm] 16:00:30
Well, that I mean, if there's a problem if there's a problem in earthquake. Now, have call, team This is Jiggis ticket, and we respond to it.
[Enrico Fermi Institute] 16:00:31
Yes.
[Enrico Fermi Institute] 16:00:38
Yes, exactly. That's what I mean. I mean the T.
[Steven Timm] 16:00:40
So here cause here is the same contract
[Enrico Fermi Institute] 16:00:42
2. If there's a problem at Wisconsin, you filing a ticket, and the person that we pay money to, or funds to from the operations program.
[Steven Timm] 16:00:51
Okay.
[Enrico Fermi Institute] 16:00:53
At this constant response to it. So in that sense, it's not that different from Porting for side operations And and again, the other great example is the the door to grid folks use experiment specific oops.
[Steven Timm] 16:00:55
Good.
[Enrico Fermi Institute] 16:01:09
Teams are even W Someg: specific offs. Teams can be fairly far separated from the okay.
[Steven Timm] 16:01:16
Yes.
[Enrico Fermi Institute] 16:01:19
The the the people who are actually operating cluster.
[Steven Timm] 16:01:20
Yeah.
[Enrico Fermi Institute] 16:01:21
Yeah, yeah, and then I want to break that operation support into 2 components.
[Enrico Fermi Institute] 16:01:27
Because one is just normal work for support, just dealing. Oh, you have a lot of failures.
[Enrico Fermi Institute] 16:01:33
Can. You look into it? And you look at not funds, or whatever usually debugging of job failures And to first this a scales with the amount of resources because the more work you pass through the more problems you can expect and there's there's overlap here, with the normal
[Enrico Fermi Institute] 16:01:50
operations support by experiment, so that the first line so it defends that basically monitors overall workflow computing operations.
[Enrico Fermi Institute] 16:01:59
And then it goes to the point up to the point where you open the gigos ticket against the side, and then the second motors.
[Enrico Fermi Institute] 16:02:07
Then once, said, Geez, ticket is open. They're going to decide.
[Enrico Fermi Institute] 16:02:09
Whoever responds we'll have to have specialized Hbc integration knowledge, because some of these failure modes can be specific to how that Hpc.
[Enrico Fermi Institute] 16:02:20
Was integrated, and that that implies that there's a long term need to keep commissioning expertise around.
[Enrico Fermi Institute] 16:02:28
But we probably need to do that anyways, because of the Hbc.
[Enrico Fermi Institute] 16:02:35
Cluster, turnover. So you need to do the the commissioning efforts need to be redone.
[Enrico Fermi Institute] 16:02:40
So that's kind of if you're talking many Hpcs so there's constantly a need to work on this stuff We've been doing this long enough.
[Enrico Fermi Institute] 16:02:48
Can't you estimate what those labor costs are?
[Enrico Fermi Institute] 16:02:52
zoom ftes. Yeah, you can. You can try to come up.
[Steven Timm] 16:02:54
Right.
[Enrico Fermi Institute] 16:02:55
I mean, we've done it for multiple years, I can for the user facilities.
[Steven Timm] 16:02:57
Oh!
[Enrico Fermi Institute] 16:03:00
You definitely can do it. The Lcf. As I said, I'm unsure because I don't know what the long-term stable operations.
[Enrico Fermi Institute] 16:03:08
Mode will look like at the moment that still need to be done.
[Enrico Fermi Institute] 16:03:11
But the user facility is definitely, We can come up with an essay and then with Tlcs.
[Steven Timm] 16:03:14
Right. I mean
[Enrico Fermi Institute] 16:03:17
Can you write down? Why, you can't get what you need from that, so that the document you can make an estimate.
[Enrico Fermi Institute] 16:03:25
But you can qualify it. No, no; What I mean is, you can do it in the user facility right?
[Steven Timm] 16:03:27
Right.
[Enrico Fermi Institute] 16:03:30
And then because they have these these properties in the Lcs. You can't.
[Steven Timm] 16:03:34
Right.
[Enrico Fermi Institute] 16:03:35
You can put some error. Bars, but they're missing these properties.
[Enrico Fermi Institute] 16:03:39
They had those properties that the user facility had. Would that allow you to give a more perspective estimate for the Lcs.
[Enrico Fermi Institute] 16:03:45
You see what I'm saying Obviously, something about the way the user facilities are set up.
[Steven Timm] 16:03:45
Okay, Well.
[Enrico Fermi Institute] 16:03:51
The Steve on Steve, Steve.
[Steven Timm] 16:03:52
Yes, hey! You you have 2 components. So what is the meanings?
[Steven Timm] 16:03:59
Were one of them is when the remote site changes, their Api.
[Steven Timm] 16:04:03
The way you have to log in. Okay, done 4 times in 6 years.
[Steven Timm] 16:04:07
Now breaking, breaking, if here is that we used, and having to change it.
[Steven Timm] 16:04:13
So that's one end of things. So I mean, this is fairly straightforward.
[Steven Timm] 16:04:19
I mean this is that's the moment. You should expect that it would change the other part of it is stuff, but upstream of us, for instance, I'm talking it's organization.
[Steven Timm] 16:04:31
I mean There, we still haven't quite Got done. All the various hecks that are done to get into the Hpc.
[Steven Timm] 16:04:40
Sites don't necessarily translate, as well as a regular site would need more work to be done.
[Steven Timm] 16:04:43
There. So if you have a big change in the upstream, most G, or things like that that can really throw us for loop
[Enrico Fermi Institute] 16:04:53
That's what I meant by technical integration commissioning work.
[Enrico Fermi Institute] 16:04:56
That there's a long-term maintenance effort.
[Steven Timm] 16:04:56
Alright.
[Steven Timm] 16:04:59
Well, it
[Enrico Fermi Institute] 16:04:59
There's always there was a bit special, so there's always the chance that something will break, and you have to do
[Steven Timm] 16:05:05
Right. You need somebody that can read it. Understand? Factory logs, basically.
[Steven Timm] 16:05:08
And in, and he called me got it
[Enrico Fermi Institute] 16:05:11
And at the maintenance isn't necessarily a evenly distributed.
[Enrico Fermi Institute] 16:05:15
For instance, no so much type thing right? Sometimes 6 months nothing happens, and then like something goes boom.
[Steven Timm] 16:05:17
Great Great. Hey? Then you have to allow for the fact that some of these people don't answer their tickets very well at all.
[Steven Timm] 16:05:28
Yeah, in particular, just good. So maybe he's got a thing to people who listen to them.
[Steven Timm] 16:05:38
We'd like to hear it, because we have very little luck
[Enrico Fermi Institute] 16:05:44
And
[Steven Timm] 16:05:45
okay.
[Enrico Fermi Institute] 16:05:47
Okay. But I think we can. We can do. We can do an attempt here to to estimate us in terms.
[Steven Timm] 16:05:52
Yeah, yeah, yeah, sure.
[Enrico Fermi Institute] 16:05:52
Of fts, we can probably on a S existing has to be said. We have for good size 52 sites, which is also an index.
[Steven Timm] 16:05:59
Well.
[Enrico Fermi Institute] 16:06:02
So science to to
[Steven Timm] 16:06:02
So the the amount of effort there to help up with into maintenance is well known.
[Enrico Fermi Institute] 16:06:07
Yeah, but I also
[Steven Timm] 16:06:09
And so basically 30% of me, basically, that's what it is.
[Steven Timm] 16:06:14
So
[Enrico Fermi Institute] 16:06:15
So, but all fts are not created equal, so somehow you have to capture the skill set that F, T. E.
[Steven Timm] 16:06:18
Good.
[Enrico Fermi Institute] 16:06:22
S. Yeah, Then that's harder to do in terms of a high-level document to I know it's harder, but you have to.
[Enrico Fermi Institute] 16:06:35
Good. But well, in yeah, Atlas and Cms have solved the same problem.
[Enrico Fermi Institute] 16:06:40
2 slightly different ways, and that requires 2 different skill sets a political and ethical.
[Enrico Fermi Institute] 16:06:47
The the one that I real like that we should hammer on is the difference of these costs.
[Enrico Fermi Institute] 16:06:54
For Lcf type type Facility versus user. So I think you could probably to communicate that more effectively.
[Enrico Fermi Institute] 16:07:03
That's probably That might be the order. Sure.
[Steven Timm] 16:07:04
oh! I mean, there's ongoing dove work and there's gonna be ongoing dev work on the Lcf side, too.
[Steven Timm] 16:07:11
I mean good, significant dev work. There.
[Enrico Fermi Institute] 16:07:12
Yeah, that's the But that's a one-time cost.
[Enrico Fermi Institute] 16:07:14
We also will want to try to estimate what the long-term operational support is, and there will be large Arab bars.
[Enrico Fermi Institute] 16:07:22
But we can. You can make an attempt
[Steven Timm] 16:07:23
Right.
[Enrico Fermi Institute] 16:07:26
And then there's another apart from the cost and effort, efforts that are directly associated with Hbc operations.
[Enrico Fermi Institute] 16:07:36
There's a secondary component. That's a bit more indirect and harder to estimate, but it will come into play at some point as we scale up Hpc: operations that we need hardware and services and grid sides to support this data job flows at the
[Enrico Fermi Institute] 16:07:51
Hbc's
[Enrico Fermi Institute] 16:07:53
Because you didn't put on as a cost, but the payload cost.
[Enrico Fermi Institute] 16:07:58
So. In other words, the as we just heard Europe in the Us.
[Enrico Fermi Institute] 16:08:03
The next generation. Big machines. We'll have more and more accelerators is how the flop They're fun, you know.
[Enrico Fermi Institute] 16:08:12
It? Do you? Molly will have Cp only party on D cause for porting things to Gpu is was specifically excluded out of scope for The school. I understand but we have to explain that that is something that will probably have to be handled because that you know, obviously cms is because cpus are in your
[Enrico Fermi Institute] 16:08:32
trigger You guys are a little bit farther ahead than Atlas.
[Enrico Fermi Institute] 16:08:36
I mean, we will put that in as a component, but we're not going to put any effort level on it, because you can, because you don't know you don't. But it's not its goal, for this for this government it's not supposed to be its goal.
[Enrico Fermi Institute] 16:08:49
another strategic thing you could talk about here is what's common verses?
[Enrico Fermi Institute] 16:08:59
What's the experiment? Specific, hey? Yeah, Yeah, keeping it at the the leading order type things.
[Enrico Fermi Institute] 16:09:08
If we go through the presentations that find overlaps, then call out, because again, when it comes to cost, you need to think about how how the agencies view Hmm!
[Enrico Fermi Institute] 16:09:23
They they do like to see common activities.
[Enrico Fermi Institute] 16:09:30
you can't make things that are common, not common.
[Enrico Fermi Institute] 16:09:33
So you you It would be death to say everything is the same, because I think sure if I rescue for a baby, I'm happy.
[Enrico Fermi Institute] 16:09:43
But trying to to call that out can be a strategic way to help people look at the cost
[Enrico Fermi Institute] 16:09:54
Steve, I see your hands still up. Did you? Did you have another comment?
[Steven Timm] 16:10:00
no, I was no.
[Enrico Fermi Institute] 16:10:02
Alright on that last bullet. Oh, no!
[Enrico Fermi Institute] 16:10:10
This is us.
[Enrico Fermi Institute] 16:10:23
When you get to the report writing, I mean, if I had a better way to to state that doesn't have to be I mean. So So what do that I would highlight?
[Enrico Fermi Institute] 16:10:33
This Does have to be inquired. Sites, For example, if you think of the the spin work at at Ersk might be perfectly fine.
[Enrico Fermi Institute] 16:10:43
So I mean so. Is it not really about it? Services? No.
[Enrico Fermi Institute] 16:10:51
because if, for instance, you wouldn't need globus and all that, if the Wlcg data grid could talk as an equal nurse could be an equal member to the Wwlc: data grid, you would not have to do any sort of translation jump through, Hoop step if
[Enrico Fermi Institute] 16:11:11
Alcf had a gatekeeper or some other something equivalent that we could.
[Enrico Fermi Institute] 16:11:18
We could both submit jobs to with tokens that would be.
[Enrico Fermi Institute] 16:11:22
That's an example of an edge service that would be common development.
[Enrico Fermi Institute] 16:11:25
That would make the cost easier for that. But but that's that's I include that more in the technical integration and long-term maintenance, And that's stuff that's happened. I'll need at the hpc sites I would include there.
[Enrico Fermi Institute] 16:11:41
That's my properties. Last board is. Say that you have services at great sites is a solution 37.
[Enrico Fermi Institute] 16:11:51
You could turn that ball baby into additional operated services for Hpc.
[Enrico Fermi Institute] 16:11:57
As opposed to say, services at grid sites, but that is a dollar cost.
[Enrico Fermi Institute] 16:12:03
That money was spent. Yeah, Yeah, and it was to work around the deficiency.
[Enrico Fermi Institute] 16:12:09
But but the point is, does that not fall under the the prior to bullets?
[Enrico Fermi Institute] 16:12:19
It. What what I thought to include here, We'll have a discussion on that, later, because there's some integration, hypotheticals and impact on the rest of the collaboration.
[Enrico Fermi Institute] 16:12:30
It's more about like. Assume you have from a lab is a big star site for Cms in the Us.
[Enrico Fermi Institute] 16:12:36
And assume you put the difference between putting 50,000 extra Cpu.
[Enrico Fermi Institute] 16:12:41
Sorry me lab, and having fair 50,000 cpus somewhere else.
[Enrico Fermi Institute] 16:12:46
This network and kinda external data serving and transport links.
[Enrico Fermi Institute] 16:12:51
Okay. So it's especially, but in terms of capital equipment, I mean.
[Enrico Fermi Institute] 16:12:56
So what we could do to say Service operations for services, support, cost, and call that out separately from operations, support.
[Enrico Fermi Institute] 16:13:05
But if you're really thinking the hardware call hardware out separate That's that's a very different color of money.
[Enrico Fermi Institute] 16:13:15
That's hardware. The last bullet is is hardware.
[Enrico Fermi Institute] 16:13:18
I can tell you how much we spend. Yeah, So as I wrote the Rbt: Yeah, in that case, don't don't mix it in with certain.
[Enrico Fermi Institute] 16:13:27
Have have a hardware. Only bullet right?
[Enrico Fermi Institute] 16:13:32
And that that hardware potentially needs renewed right.
[Enrico Fermi Institute] 16:13:36
Of course, if we need it, you I need it. I mean what I mean is, if we need, if we continue to need it, we have to continue to fund it so I would just put that last one into at least 2 calls.
[Enrico Fermi Institute] 16:13:47
Yes, okay, yes, I think that was the last time we had for today.
[Enrico Fermi Institute] 16:13:53
That is, are you thinking at the end for any other strategic report?
[Enrico Fermi Institute] 16:13:57
On December or whatever to have a dollar range Here Is that the install, or just pointing out they considerations that need to be made and
[Enrico Fermi Institute] 16:14:10
We are specifically. We were discouraged from comparing Hpc.
[Enrico Fermi Institute] 16:14:16
Cloud cost 2 great costs, and it was a little bit of a I can force, but at the end that's the decision that was made.
[Enrico Fermi Institute] 16:14:24
So we should Just tried to come up with some cost on their own.
[Enrico Fermi Institute] 16:14:29
So with comparison. But I mean, are you saying for user facility? Like nurse?
[Enrico Fermi Institute] 16:14:34
We need between x
[Enrico Fermi Institute] 16:14:40
we'll put an Fde number different, depending on where, as an Unc.
[Enrico Fermi Institute] 16:14:51
Cost cost. Can you also phone? And should be also folded?
[Enrico Fermi Institute] 16:14:54
X amount of Cpu cores. Efficient running means.
[Enrico Fermi Institute] 16:14:59
Why amount of disc at the site, so that if we can't get the Y.
[Enrico Fermi Institute] 16:15:05
Amount of disk through the grant procedure, then that would actually be a cost, because you would have to do the condo model of buying storage. Well, that's why like just having a separate hardware bullet where the hardware sets you gotta you gotta I mean obviously you care. Where the
[Enrico Fermi Institute] 16:15:26
hardware sits, but they'll have, but there will be a a capital outlet
[Enrico Fermi Institute] 16:15:32
If this last part to the discussion this morning about data delivery, and having significant cash, or I did in point at the Hpcs.
[Enrico Fermi Institute] 16:15:45
If you wanted to do it that way. I don't mean to.
[Enrico Fermi Institute] 16:15:49
I guess the idea is that that would come through an allocation if it's part of the facility, right?
[Enrico Fermi Institute] 16:15:53
So maybe that is a department If they give us a storage, then it comes from the Yeah.
[Enrico Fermi Institute] 16:15:58
But if we get very little storage that puts a lot of pressure on a network and then storage somewhere else, because you have to be very.
[Enrico Fermi Institute] 16:16:06
She She can think of it this way. I get 500 pirates with my allocation, but I need a petabyte, And how do I make up the need the needs cap I Either make it up through filling up the stuff go streaming in.
[Enrico Fermi Institute] 16:16:18
And out, or I make a a buy storage at the side, and and is so on.
[Enrico Fermi Institute] 16:16:28
How much time you have to fill out. You can talk about the different types of costs and different example scenarios, because cause problem with.
[Enrico Fermi Institute] 16:16:36
So these things about caches, or you know, looking at it, site and it's a trade-off, you can say.
[Enrico Fermi Institute] 16:16:42
Well, if I put 200 TB on the site, I might say the extermination years.
[Enrico Fermi Institute] 16:16:48
but then obviously some sites. No, or I I I can, find a quote for what it takes which termites, own that expanse as an example, but usually usually about 8 or 5 storage.
[Enrico Fermi Institute] 16:17:04
Then Well, that's the problem. What what is Usually I can tell you I'm doing this.
[Enrico Fermi Institute] 16:17:10
I can tell you what the nurse allows you to pay by Give the money and do it, and some of their smaller sites.
[Enrico Fermi Institute] 16:17:17
That's in fact, how the Atp group got into the Lcrcs.
[Enrico Fermi Institute] 16:17:22
They have a condom try to check. They'll deploy it to.
[Enrico Fermi Institute] 16:17:27
That's be it, though, because storage is like a multi.
[Enrico Fermi Institute] 16:17:31
Your commitment? Or do you pay for Do you rent it?
[Enrico Fermi Institute] 16:17:35
You pay for you, you basically depends on it. It's usually, you know, for a quant of time which may be multi year, but at the end of the quanta bye bye, way up a couple of scenarios to avoid the fact that some of these are trade offs and it to communicate
[Enrico Fermi Institute] 16:17:57
But we prefer that it comes through the allocation process, because, indeed, application we lay out a use case, and we say, we can use this much Cpu And then But then we need that much storage to actually effectively use it?
[Enrico Fermi Institute] 16:18:10
So this would be a
[Enrico Fermi Institute] 16:18:13
Could not be a preferred choice that we have to buy storage. Gets into how much time you want to spend joining them scenarios.
[Enrico Fermi Institute] 16:18:21
There's a lot to write here. The Hpc facilities typically haven't had in their architecture something sitting there that's looking like my cash that's that's facing the white area. Network.
[Enrico Fermi Institute] 16:18:33
I. In other words, they have different ways of provisioning storage within.
[Enrico Fermi Institute] 16:18:40
But usually like we saw from like that nurse. If there's a big scratch disc, there's a there's other storage there I mean, there's the home file system the big scratch area. There didn't, seem to be is there something that's sitting on the edge, of
[Enrico Fermi Institute] 16:18:53
the network that could actually serve as a cache
[Enrico Fermi Institute] 16:18:59
I mean, the file system are connected. Get a data transfer, not to the outside, and that's a separate connection.
[Enrico Fermi Institute] 16:19:05
It's not internal, but that's usually high speed, so you can get in and out of there.
[Enrico Fermi Institute] 16:19:11
It's not visible on the onset, though. What's your budget
[Enrico Fermi Institute] 16:19:17
I think it's what Doug was saying
[Enrico Fermi Institute] 16:19:21
You 5 more switches. I remember the cash, so we'll say yes
[Enrico Fermi Institute] 16:19:30
Okay, any other comments from the Zoom
[Enrico Fermi Institute] 16:19:38
I think we're done. Thanks, everybody for slogging it out.
[Enrico Fermi Institute] 16:19:43
Yeah, So I think that's good, because we've I mean, we'll come back to Hpc at some of the later discussions.
[Enrico Fermi Institute] 16:19:50
But the focus tomorrow morning will be on. Yes, start with the cloud focus area tomorrow, and then in the afternoon we'll have networks, integration, hypotheticals, and R.
[Enrico Fermi Institute] 16:20:04
And D: Okay, Good. Thanks, everybody. We'll talk to you tomorrow.
[Antonio Perez-Calero Yzquierdo] 16:20:09
Thank you.