HPC Cost and discussions

[Enrico Fermi Institute] 15:57:32
Cost, and then we're right up on the Yeah.

[Enrico Fermi Institute] 15:57:35
There was a question on the charge or remember to share.

[Enrico Fermi Institute] 15:57:41
At this time the total cost of operating Hbc resources, and they especially included the the outlook to each, and and the thing is the the cost of operating it I mean This is really about operation acquiring and operating because you nominally they're free I mean

[Enrico Fermi Institute] 15:58:02
eventually there's some indirect effect, because you get them from the same funding agencies.

[Enrico Fermi Institute] 15:58:07
That fund you purchase hardware, but that's indirect, and that's also also the scope of this in this workshop.

[Enrico Fermi Institute] 15:58:14
So you you basically have to prepare your proposals once per year, usually access allows supplementals.

[Enrico Fermi Institute] 15:58:22
there's work on multi year proposals, and maybe that will mean that you still have to do a proposal each year.

[Enrico Fermi Institute] 15:58:30
But you don't have to do much work for it.

[Enrico Fermi Institute] 15:58:31
You just sign it off with your request. You already know what you're getting, and but this is a work in progress, and then there's technically integration, permissioning Mark and that's mostly one time.

[Enrico Fermi Institute] 15:58:43
is it you you integrate a facility once you find a way to make it work, and then you just have to maintain what you came up with, and this needs to be redone every free year.

[Enrico Fermi Institute] 15:58:56
Because these Hbc have a limited lifetime.

[Enrico Fermi Institute] 15:58:58
Basically, 5 years is around the maximum expect replace it with a different machine.

[Enrico Fermi Institute] 15:59:03
The what we experienced so far is the synergy effects.

[Enrico Fermi Institute] 15:59:07
If you stay within the same facility, because usually they have similar restrictions similar ways to do things so switching from one to to another cluster in the same facility, that when they do a replacement you you don't have to throw out everything and stuff from scratch you just make adjustments to what

[Enrico Fermi Institute] 15:59:27
you probably did before. It's there's an open question on the Lcf.

[Enrico Fermi Institute] 15:59:34
Integration, at least for a Cms Side I mean, you have your harvesteds for us at least long-term operational overheads.

[Enrico Fermi Institute] 15:59:42
There, a little harder to estimate They're likely also larger there, because the provisioning integration looks like it's gonna be a bit more complex, and not tight neatly into what we're doing anyways.

[Enrico Fermi Institute] 15:59:57
For the good size, So you need to do something special. Then support.

[Enrico Fermi Institute] 16:00:02
I mean, that's one of the things that came up in the context of pledging.

[Enrico Fermi Institute] 16:00:07
It's something you need to be able to send a ticket.

[Enrico Fermi Institute] 16:00:11
So there's operation support, because you don't have less Cms side contact.

[Enrico Fermi Institute] 16:00:16
Now, admittedly the grid says, Dt. Twos.

[Enrico Fermi Institute] 16:00:19
The side context is also someone usually the operations program baseball.

[Steven Timm] 16:00:23
hmm.

[Enrico Fermi Institute] 16:00:24
Is not that this is necessarily cost. That's unique to the Hbc.

[Steven Timm] 16:00:29
well, that

[Steven Timm] 16:00:30
Well, that I mean, if there's a problem if there's a problem in earthquake. Now, have call, team This is Jiggis ticket, and we respond to it.

[Enrico Fermi Institute] 16:00:31
Yes.

[Enrico Fermi Institute] 16:00:38
Yes, exactly. That's what I mean. I mean the T.

[Steven Timm] 16:00:40
So here cause here is the same contract

[Enrico Fermi Institute] 16:00:42
2. If there's a problem at Wisconsin, you filing a ticket, and the person that we pay money to, or funds to from the operations program.

[Steven Timm] 16:00:51
Okay.

[Enrico Fermi Institute] 16:00:53
At this constant response to it. So in that sense, it's not that different from Porting for side operations And and again, the other great example is the the door to grid folks use experiment specific oops.

[Steven Timm] 16:00:55
Good.

[Enrico Fermi Institute] 16:01:09
Teams are even W Someg: specific offs. Teams can be fairly far separated from the okay.

[Steven Timm] 16:01:16
Yes.

[Enrico Fermi Institute] 16:01:19
The the the people who are actually operating cluster.

[Steven Timm] 16:01:20
Yeah.

[Enrico Fermi Institute] 16:01:21
Yeah, yeah, and then I want to break that operation support into 2 components.

[Enrico Fermi Institute] 16:01:27
Because one is just normal work for support, just dealing. Oh, you have a lot of failures.

[Enrico Fermi Institute] 16:01:33
Can. You look into it? And you look at not funds, or whatever usually debugging of job failures And to first this a scales with the amount of resources because the more work you pass through the more problems you can expect and there's there's overlap here, with the normal

[Enrico Fermi Institute] 16:01:50
operations support by experiment, so that the first line so it defends that basically monitors overall workflow computing operations.

[Enrico Fermi Institute] 16:01:59
And then it goes to the point up to the point where you open the gigos ticket against the side, and then the second motors.

[Enrico Fermi Institute] 16:02:07
Then once, said, Geez, ticket is open. They're going to decide.

[Enrico Fermi Institute] 16:02:09
Whoever responds we'll have to have specialized Hbc integration knowledge, because some of these failure modes can be specific to how that Hpc.

[Enrico Fermi Institute] 16:02:20
Was integrated, and that that implies that there's a long term need to keep commissioning expertise around.

[Enrico Fermi Institute] 16:02:28
But we probably need to do that anyways, because of the Hbc.

[Enrico Fermi Institute] 16:02:35
Cluster, turnover. So you need to do the the commissioning efforts need to be redone.

[Enrico Fermi Institute] 16:02:40
So that's kind of if you're talking many Hpcs so there's constantly a need to work on this stuff We've been doing this long enough.

[Enrico Fermi Institute] 16:02:48
Can't you estimate what those labor costs are?

[Enrico Fermi Institute] 16:02:52
zoom ftes. Yeah, you can. You can try to come up.

[Steven Timm] 16:02:54
Right.

[Enrico Fermi Institute] 16:02:55
I mean, we've done it for multiple years, I can for the user facilities.

[Steven Timm] 16:02:57
Oh!

[Enrico Fermi Institute] 16:03:00
You definitely can do it. The Lcf. As I said, I'm unsure because I don't know what the long-term stable operations.

[Enrico Fermi Institute] 16:03:08
Mode will look like at the moment that still need to be done.

[Enrico Fermi Institute] 16:03:11
But the user facility is definitely, We can come up with an essay and then with Tlcs.

[Steven Timm] 16:03:14
Right. I mean

[Enrico Fermi Institute] 16:03:17
Can you write down? Why, you can't get what you need from that, so that the document you can make an estimate.

[Enrico Fermi Institute] 16:03:25
But you can qualify it. No, no; What I mean is, you can do it in the user facility right?

[Steven Timm] 16:03:27
Right.

[Enrico Fermi Institute] 16:03:30
And then because they have these these properties in the Lcs. You can't.

[Steven Timm] 16:03:34
Right.

[Enrico Fermi Institute] 16:03:35
You can put some error. Bars, but they're missing these properties.

[Enrico Fermi Institute] 16:03:39
They had those properties that the user facility had. Would that allow you to give a more perspective estimate for the Lcs.

[Enrico Fermi Institute] 16:03:45
You see what I'm saying Obviously, something about the way the user facilities are set up.

[Steven Timm] 16:03:45
Okay, Well.

[Enrico Fermi Institute] 16:03:51
The Steve on Steve, Steve.

[Steven Timm] 16:03:52
Yes, hey! You you have 2 components. So what is the meanings?

[Steven Timm] 16:03:59
Were one of them is when the remote site changes, their Api.

[Steven Timm] 16:04:03
The way you have to log in. Okay, done 4 times in 6 years.

[Steven Timm] 16:04:07
Now breaking, breaking, if here is that we used, and having to change it.

[Steven Timm] 16:04:13
So that's one end of things. So I mean, this is fairly straightforward.

[Steven Timm] 16:04:19
I mean this is that's the moment. You should expect that it would change the other part of it is stuff, but upstream of us, for instance, I'm talking it's organization.

[Steven Timm] 16:04:31
I mean There, we still haven't quite Got done. All the various hecks that are done to get into the Hpc.

[Steven Timm] 16:04:40
Sites don't necessarily translate, as well as a regular site would need more work to be done.

[Steven Timm] 16:04:43
There. So if you have a big change in the upstream, most G, or things like that that can really throw us for loop

[Enrico Fermi Institute] 16:04:53
That's what I meant by technical integration commissioning work.

[Enrico Fermi Institute] 16:04:56
That there's a long-term maintenance effort.

[Steven Timm] 16:04:56
Alright.

[Steven Timm] 16:04:59
Well, it

[Enrico Fermi Institute] 16:04:59
There's always there was a bit special, so there's always the chance that something will break, and you have to do

[Steven Timm] 16:05:05
Right. You need somebody that can read it. Understand? Factory logs, basically.

[Steven Timm] 16:05:08
And in, and he called me got it

[Enrico Fermi Institute] 16:05:11
And at the maintenance isn't necessarily a evenly distributed.

[Enrico Fermi Institute] 16:05:15
For instance, no so much type thing right? Sometimes 6 months nothing happens, and then like something goes boom.

[Steven Timm] 16:05:17
Great Great. Hey? Then you have to allow for the fact that some of these people don't answer their tickets very well at all.

[Steven Timm] 16:05:28
Yeah, in particular, just good. So maybe he's got a thing to people who listen to them.

[Steven Timm] 16:05:38
We'd like to hear it, because we have very little luck

[Enrico Fermi Institute] 16:05:44
And

[Steven Timm] 16:05:45
okay.

[Enrico Fermi Institute] 16:05:47
Okay. But I think we can. We can do. We can do an attempt here to to estimate us in terms.

[Steven Timm] 16:05:52
Yeah, yeah, yeah, sure.

[Enrico Fermi Institute] 16:05:52
Of fts, we can probably on a S existing has to be said. We have for good size 52 sites, which is also an index.

[Steven Timm] 16:05:59
Well.

[Enrico Fermi Institute] 16:06:02
So science to to

[Steven Timm] 16:06:02
So the the amount of effort there to help up with into maintenance is well known.

[Enrico Fermi Institute] 16:06:07
Yeah, but I also

[Steven Timm] 16:06:09
And so basically 30% of me, basically, that's what it is.

[Steven Timm] 16:06:14
So

[Enrico Fermi Institute] 16:06:15
So, but all fts are not created equal, so somehow you have to capture the skill set that F, T. E.

[Steven Timm] 16:06:18
Good.

[Enrico Fermi Institute] 16:06:22
S. Yeah, Then that's harder to do in terms of a high-level document to I know it's harder, but you have to.

[Enrico Fermi Institute] 16:06:35
Good. But well, in yeah, Atlas and Cms have solved the same problem.

[Enrico Fermi Institute] 16:06:40
2 slightly different ways, and that requires 2 different skill sets a political and ethical.

[Enrico Fermi Institute] 16:06:47
The the one that I real like that we should hammer on is the difference of these costs.

[Enrico Fermi Institute] 16:06:54
For Lcf type type Facility versus user. So I think you could probably to communicate that more effectively.

[Enrico Fermi Institute] 16:07:03
That's probably That might be the order. Sure.

[Steven Timm] 16:07:04
oh! I mean, there's ongoing dove work and there's gonna be ongoing dev work on the Lcf side, too.

[Steven Timm] 16:07:11
I mean good, significant dev work. There.

[Enrico Fermi Institute] 16:07:12
Yeah, that's the But that's a one-time cost.

[Enrico Fermi Institute] 16:07:14
We also will want to try to estimate what the long-term operational support is, and there will be large Arab bars.

[Enrico Fermi Institute] 16:07:22
But we can. You can make an attempt

[Steven Timm] 16:07:23
Right.

[Enrico Fermi Institute] 16:07:26
And then there's another apart from the cost and effort, efforts that are directly associated with Hbc operations.

[Enrico Fermi Institute] 16:07:36
There's a secondary component. That's a bit more indirect and harder to estimate, but it will come into play at some point as we scale up Hpc: operations that we need hardware and services and grid sides to support this data job flows at the

[Enrico Fermi Institute] 16:07:51
Hbc's

[Enrico Fermi Institute] 16:07:53
Because you didn't put on as a cost, but the payload cost.

[Enrico Fermi Institute] 16:07:58
So. In other words, the as we just heard Europe in the Us.

[Enrico Fermi Institute] 16:08:03
The next generation. Big machines. We'll have more and more accelerators is how the flop They're fun, you know.

[Enrico Fermi Institute] 16:08:12
It? Do you? Molly will have Cp only party on D cause for porting things to Gpu is was specifically excluded out of scope for The school. I understand but we have to explain that that is something that will probably have to be handled because that you know, obviously cms is because cpus are in your

[Enrico Fermi Institute] 16:08:32
trigger You guys are a little bit farther ahead than Atlas.

[Enrico Fermi Institute] 16:08:36
I mean, we will put that in as a component, but we're not going to put any effort level on it, because you can, because you don't know you don't. But it's not its goal, for this for this government it's not supposed to be its goal.

[Enrico Fermi Institute] 16:08:49
another strategic thing you could talk about here is what's common verses?

[Enrico Fermi Institute] 16:08:59
What's the experiment? Specific, hey? Yeah, Yeah, keeping it at the the leading order type things.

[Enrico Fermi Institute] 16:09:08
If we go through the presentations that find overlaps, then call out, because again, when it comes to cost, you need to think about how how the agencies view Hmm!

[Enrico Fermi Institute] 16:09:23
They they do like to see common activities.

[Enrico Fermi Institute] 16:09:30
you can't make things that are common, not common.

[Enrico Fermi Institute] 16:09:33
So you you It would be death to say everything is the same, because I think sure if I rescue for a baby, I'm happy.

[Enrico Fermi Institute] 16:09:43
But trying to to call that out can be a strategic way to help people look at the cost

[Enrico Fermi Institute] 16:09:54
Steve, I see your hands still up. Did you? Did you have another comment?

[Steven Timm] 16:10:00
no, I was no.

[Enrico Fermi Institute] 16:10:02
Alright on that last bullet. Oh, no!

[Enrico Fermi Institute] 16:10:10
This is us.

[Enrico Fermi Institute] 16:10:23
When you get to the report writing, I mean, if I had a better way to to state that doesn't have to be I mean. So So what do that I would highlight?

[Enrico Fermi Institute] 16:10:33
This Does have to be inquired. Sites, For example, if you think of the the spin work at at Ersk might be perfectly fine.

[Enrico Fermi Institute] 16:10:43
So I mean so. Is it not really about it? Services? No.

[Enrico Fermi Institute] 16:10:51
because if, for instance, you wouldn't need globus and all that, if the Wlcg data grid could talk as an equal nurse could be an equal member to the Wwlc: data grid, you would not have to do any sort of translation jump through, Hoop step if

[Enrico Fermi Institute] 16:11:11
Alcf had a gatekeeper or some other something equivalent that we could.

[Enrico Fermi Institute] 16:11:18
We could both submit jobs to with tokens that would be.

[Enrico Fermi Institute] 16:11:22
That's an example of an edge service that would be common development.

[Enrico Fermi Institute] 16:11:25
That would make the cost easier for that. But but that's that's I include that more in the technical integration and long-term maintenance, And that's stuff that's happened. I'll need at the hpc sites I would include there.

[Enrico Fermi Institute] 16:11:41
That's my properties. Last board is. Say that you have services at great sites is a solution 37.

[Enrico Fermi Institute] 16:11:51
You could turn that ball baby into additional operated services for Hpc.

[Enrico Fermi Institute] 16:11:57
As opposed to say, services at grid sites, but that is a dollar cost.

[Enrico Fermi Institute] 16:12:03
That money was spent. Yeah, Yeah, and it was to work around the deficiency.

[Enrico Fermi Institute] 16:12:09
But but the point is, does that not fall under the the prior to bullets?

[Enrico Fermi Institute] 16:12:19
It. What what I thought to include here, We'll have a discussion on that, later, because there's some integration, hypotheticals and impact on the rest of the collaboration.

[Enrico Fermi Institute] 16:12:30
It's more about like. Assume you have from a lab is a big star site for Cms in the Us.

[Enrico Fermi Institute] 16:12:36
And assume you put the difference between putting 50,000 extra Cpu.

[Enrico Fermi Institute] 16:12:41
Sorry me lab, and having fair 50,000 cpus somewhere else.

[Enrico Fermi Institute] 16:12:46
This network and kinda external data serving and transport links.

[Enrico Fermi Institute] 16:12:51
Okay. So it's especially, but in terms of capital equipment, I mean.

[Enrico Fermi Institute] 16:12:56
So what we could do to say Service operations for services, support, cost, and call that out separately from operations, support.

[Enrico Fermi Institute] 16:13:05
But if you're really thinking the hardware call hardware out separate That's that's a very different color of money.

[Enrico Fermi Institute] 16:13:15
That's hardware. The last bullet is is hardware.

[Enrico Fermi Institute] 16:13:18
I can tell you how much we spend. Yeah, So as I wrote the Rbt: Yeah, in that case, don't don't mix it in with certain.

[Enrico Fermi Institute] 16:13:27
Have have a hardware. Only bullet right?

[Enrico Fermi Institute] 16:13:32
And that that hardware potentially needs renewed right.

[Enrico Fermi Institute] 16:13:36
Of course, if we need it, you I need it. I mean what I mean is, if we need, if we continue to need it, we have to continue to fund it so I would just put that last one into at least 2 calls.

[Enrico Fermi Institute] 16:13:47
Yes, okay, yes, I think that was the last time we had for today.

[Enrico Fermi Institute] 16:13:53
That is, are you thinking at the end for any other strategic report?

[Enrico Fermi Institute] 16:13:57
On December or whatever to have a dollar range Here Is that the install, or just pointing out they considerations that need to be made and

[Enrico Fermi Institute] 16:14:10
We are specifically. We were discouraged from comparing Hpc.

[Enrico Fermi Institute] 16:14:16
Cloud cost 2 great costs, and it was a little bit of a I can force, but at the end that's the decision that was made.

[Enrico Fermi Institute] 16:14:24
So we should Just tried to come up with some cost on their own.

[Enrico Fermi Institute] 16:14:29
So with comparison. But I mean, are you saying for user facility? Like nurse?

[Enrico Fermi Institute] 16:14:34
We need between x

[Enrico Fermi Institute] 16:14:40
we'll put an Fde number different, depending on where, as an Unc.

[Enrico Fermi Institute] 16:14:51
Cost cost. Can you also phone? And should be also folded?

[Enrico Fermi Institute] 16:14:54
X amount of Cpu cores. Efficient running means.

[Enrico Fermi Institute] 16:14:59
Why amount of disc at the site, so that if we can't get the Y.

[Enrico Fermi Institute] 16:15:05
Amount of disk through the grant procedure, then that would actually be a cost, because you would have to do the condo model of buying storage. Well, that's why like just having a separate hardware bullet where the hardware sets you gotta you gotta I mean obviously you care. Where the

[Enrico Fermi Institute] 16:15:26
hardware sits, but they'll have, but there will be a a capital outlet

[Enrico Fermi Institute] 16:15:32
If this last part to the discussion this morning about data delivery, and having significant cash, or I did in point at the Hpcs.

[Enrico Fermi Institute] 16:15:45
If you wanted to do it that way. I don't mean to.

[Enrico Fermi Institute] 16:15:49
I guess the idea is that that would come through an allocation if it's part of the facility, right?

[Enrico Fermi Institute] 16:15:53
So maybe that is a department If they give us a storage, then it comes from the Yeah.

[Enrico Fermi Institute] 16:15:58
But if we get very little storage that puts a lot of pressure on a network and then storage somewhere else, because you have to be very.

[Enrico Fermi Institute] 16:16:06
She She can think of it this way. I get 500 pirates with my allocation, but I need a petabyte, And how do I make up the need the needs cap I Either make it up through filling up the stuff go streaming in.

[Enrico Fermi Institute] 16:16:18
And out, or I make a a buy storage at the side, and and is so on.

[Enrico Fermi Institute] 16:16:28
How much time you have to fill out. You can talk about the different types of costs and different example scenarios, because cause problem with.

[Enrico Fermi Institute] 16:16:36
So these things about caches, or you know, looking at it, site and it's a trade-off, you can say.

[Enrico Fermi Institute] 16:16:42
Well, if I put 200 TB on the site, I might say the extermination years.

[Enrico Fermi Institute] 16:16:48
but then obviously some sites. No, or I I I can, find a quote for what it takes which termites, own that expanse as an example, but usually usually about 8 or 5 storage.

[Enrico Fermi Institute] 16:17:04
Then Well, that's the problem. What what is Usually I can tell you I'm doing this.

[Enrico Fermi Institute] 16:17:10
I can tell you what the nurse allows you to pay by Give the money and do it, and some of their smaller sites.

[Enrico Fermi Institute] 16:17:17
That's in fact, how the Atp group got into the Lcrcs.

[Enrico Fermi Institute] 16:17:22
They have a condom try to check. They'll deploy it to.

[Enrico Fermi Institute] 16:17:27
That's be it, though, because storage is like a multi.

[Enrico Fermi Institute] 16:17:31
Your commitment? Or do you pay for Do you rent it?

[Enrico Fermi Institute] 16:17:35
You pay for you, you basically depends on it. It's usually, you know, for a quant of time which may be multi year, but at the end of the quanta bye bye, way up a couple of scenarios to avoid the fact that some of these are trade offs and it to communicate

[Enrico Fermi Institute] 16:17:57
But we prefer that it comes through the allocation process, because, indeed, application we lay out a use case, and we say, we can use this much Cpu And then But then we need that much storage to actually effectively use it?

[Enrico Fermi Institute] 16:18:10
So this would be a

[Enrico Fermi Institute] 16:18:13
Could not be a preferred choice that we have to buy storage. Gets into how much time you want to spend joining them scenarios.

[Enrico Fermi Institute] 16:18:21
There's a lot to write here. The Hpc facilities typically haven't had in their architecture something sitting there that's looking like my cash that's that's facing the white area. Network.

[Enrico Fermi Institute] 16:18:33
I. In other words, they have different ways of provisioning storage within.

[Enrico Fermi Institute] 16:18:40
But usually like we saw from like that nurse. If there's a big scratch disc, there's a there's other storage there I mean, there's the home file system the big scratch area. There didn't, seem to be is there something that's sitting on the edge, of

[Enrico Fermi Institute] 16:18:53
the network that could actually serve as a cache

[Enrico Fermi Institute] 16:18:59
I mean, the file system are connected. Get a data transfer, not to the outside, and that's a separate connection.

[Enrico Fermi Institute] 16:19:05
It's not internal, but that's usually high speed, so you can get in and out of there.

[Enrico Fermi Institute] 16:19:11
It's not visible on the onset, though. What's your budget

[Enrico Fermi Institute] 16:19:17
I think it's what Doug was saying

[Enrico Fermi Institute] 16:19:21
You 5 more switches. I remember the cash, so we'll say yes

[Enrico Fermi Institute] 16:19:30
Okay, any other comments from the Zoom

[Enrico Fermi Institute] 16:19:38
I think we're done. Thanks, everybody for slogging it out.

[Enrico Fermi Institute] 16:19:43
Yeah, So I think that's good, because we've I mean, we'll come back to Hpc at some of the later discussions.

[Enrico Fermi Institute] 16:19:50
But the focus tomorrow morning will be on. Yes, start with the cloud focus area tomorrow, and then in the afternoon we'll have networks, integration, hypotheticals, and R.

[Enrico Fermi Institute] 16:20:04
And D: Okay, Good. Thanks, everybody. We'll talk to you tomorrow.

[Antonio Perez-Calero Yzquierdo] 16:20:09
Thank you.