- Landscape of workflows: US HPC

[Enrico Fermi Institute] 11:09:56
Everybody can hear just fine. Yeah, Because I'm sitting here I'm just getting picked up by the mic on the ceiling. Okay?

[David Mason] 11:09:59
we can hear.

[Enrico Fermi Institute] 11:10:01
Great. Great. Thank you. Okay. So if you go to the next slide, So the first area that we want to cover is looking a little bit on the what we're doing in terms of workflows on Hbc and cloud and to do that maybe at the very first we look at what resources are we actually looking at

[Enrico Fermi Institute] 11:10:19
here at the right now So if you look at what's available for us, Hbc: we have broadly 2 types of facilities, and they have different new user experiences in terms of how you approach them how you can use them and there's the leadership class facilities funded by doe argon

[Enrico Fermi Institute] 11:10:38
Oakridge, and so on, and they are kind of.

[Enrico Fermi Institute] 11:10:41
They're very restricted. They focus on accelerators to get the most flops for good power budget.

[Enrico Fermi Institute] 11:10:47
They don't care too much about making it easy for the user.

[Enrico Fermi Institute] 11:10:50
You are expected to adjust your work, for to be able to run there, and they target large scale workforce.

[Enrico Fermi Institute] 11:10:57
This is the kind of stuff that you can do. Nowhere else.

[Enrico Fermi Institute] 11:10:59
You go to the and then the the user, facilities. Nurse tag, the exceed excess sites, which is they usually a mix.

[Enrico Fermi Institute] 11:11:11
Some of them are straight out like they look like Hpc compute clusters, and how they build.

[Enrico Fermi Institute] 11:11:17
Some of them have interconnects There might be a mix of gpus and cpus, mostly still cpus, and they take all comments.

[Enrico Fermi Institute] 11:11:25
Basically you can get an allocation. You can get going.

[Enrico Fermi Institute] 11:11:28
They work with you to try to make it easy to, so you can get on the facility and get your work next slide. And at any time if you wanna have a comment or a question, please just ask it we're not supposed to go.

[Enrico Fermi Institute] 11:11:42
Through the big presentation. So it's discussion. Yep.

[Enrico Fermi Institute] 11:11:48
so, and then looking at that, with that in mind, What are we currently running there?

[Enrico Fermi Institute] 11:11:53
So this is the right now, and the green. If you see a green that's straightforward copy from the charge, there's a question I was asked to ask, so to answer that here what we're doing right now so for cms.

[Enrico Fermi Institute] 11:12:05
What we're doing is we basically anything that starts with a generator step and has no input except for pile up.

[Enrico Fermi Institute] 11:12:12
We currently assigning to a lot of us Hbc sites You don't have to do anything special to work for gets injected with automatically.

[Enrico Fermi Institute] 11:12:19
You Can run there and that's that was the majority of run to Monte Carlo.

[Enrico Fermi Institute] 11:12:24
Workflows and the run. Three-month caller work was kind of, not a political sense, and for Atlas it's primarily simulation.

[Enrico Fermi Institute] 11:12:33
Usually they are specifically so to Hbc size. So you select the bunch of, I guess you pick.

[Enrico Fermi Institute] 11:12:40
This is this: This is a good fit, and then you assign it there, and it runs.

[Enrico Fermi Institute] 11:12:42
It, and they also have the goal to expand on that the limiting fact factors.

[Enrico Fermi Institute] 11:12:53
If in what workflows you target at Hbc are usually based on machine characteristics, So Cpu architectures, certain Hpc: I mean intel is easy to use when it gets beyond that, currently still a little bit difficult did you have a Gpu Accelerator: how much memory.

[Enrico Fermi Institute] 11:13:10
You have per call, remember, Perkins, with Kl. Kind of the dying breed, is kind of disappearing a bit.

[Enrico Fermi Institute] 11:13:16
So it's usually okay. Now then, network connectivity, And it's not just tune from the note, like by the Scf falls It's also for the facility as a whole.

[Enrico Fermi Institute] 11:13:27
Sometimes Hbc: Yes. Facility, restrictions or firewall limits where you once, when you scale up, you hit scaling limits where you basically overlook go to the pipe because they don't.

[Enrico Fermi Institute] 11:13:38
They're not used to such data. Intensive workflows.

[Enrico Fermi Institute] 11:13:41
So again I quick question back when we were talking about Cpu architecture and loading point up operations.

[Enrico Fermi Institute] 11:13:47
Yeah, what in particular is making that hard from your perspective. It's basically showing the arm or something going to arm is not harder.

[Enrico Fermi Institute] 11:13:57
It's just a matter of extra work to validate the platform.

[Enrico Fermi Institute] 11:14:00
Okay, So it's really about numerical outcomes and making sure that things agree between Yeah, it's basically a one-time investment of basically, being able to support the platform that's true, on all of them though cause that's not true for the the yeah, Olcf: was a bit is a bit of

[Enrico Fermi Institute] 11:14:18
a also have his power, you know. Cms just finished the power validation.

[Enrico Fermi Institute] 11:14:23
Okay, So you. So the the requirement, then, is the the effective requirement is for a given sort of Cpu architecture.

[Enrico Fermi Institute] 11:14:32
The upstream code has to be valid. It Well, firstly, you have to build your code.

[Enrico Fermi Institute] 11:14:38
I've got to be buildable, and then and then you need to run like whatever physics validation you produce, Some samples.

[Enrico Fermi Institute] 11:14:43
And then the physicist, the physics group, whatever in the global collaboration, needs to go in and say, this is actually okay.

[Enrico Fermi Institute] 11:14:50
So there's a depend. So, therefore, there's a dependency and a on external to to you.

[Enrico Fermi Institute] 11:15:02
requires labor from outside of us, because the Us.

[Enrico Fermi Institute] 11:15:07
Can't just say this platform is validated. The experiment, as a whole has to say that so, coming back to the why, you couldn't do pile up during digitization, because you had to read extra remotely you Can do it?

[Enrico Fermi Institute] 11:15:24
And that's we spot that basically we don't current.

[Enrico Fermi Institute] 11:15:28
We currently don't run anything that needs primary Newport but Pilot is supported because Pilot is is so unevenly distributed because of its size that we anyways for normal production even on some tier 2 sites they also read it remotely, so that's the use case that the support anyways to

[Enrico Fermi Institute] 11:15:48
the x, So the Hpc. Just expanded. So it's not a limitation.

[Enrico Fermi Institute] 11:15:53
No, yeah, I mean, eventually, as you scale up, it comes. Then the network connectivity comes in.

[Enrico Fermi Institute] 11:15:59
We have to look at that, for instance, at Frontera we're hitting scaling limits because of remote Parliament.

[Enrico Fermi Institute] 11:16:06
I thought it frontier. There was a limit on the amount of yeah, the amount of remote access you could do from around.

[Enrico Fermi Institute] 11:16:13
You can see. Yeah, So we actually hit the external connectivity limit of the facility.

[Enrico Fermi Institute] 11:16:19
And as I recall, Frontera, they mostly consider their ethernet to be like a control plane.

[Enrico Fermi Institute] 11:16:26
Each node in the rack is connected at one giving, and each rack is connected.

[Enrico Fermi Institute] 11:16:31
At 10 years ago. I think something like that to your core.

[Enrico Fermi Institute] 11:16:36
So in that case you probably weren't doing a lot of pile up at front.

[Enrico Fermi Institute] 11:16:38
We were reading Pilot: Okay, So you aren't hitting, I mean, But you are running.

[Enrico Fermi Institute] 11:16:42
You were used. You're accessing your pilot data sets by Ethernet, though.

[Enrico Fermi Institute] 11:16:48
Yeah, So you aren't hitting. You're still hitting the overall capacity of the of of attack.

[Enrico Fermi Institute] 11:16:54
Then, yeah, like 100 gig, or something like the well, we in the beginning, we hit the we actually hit the the scaling limitations on Fi: one trying to get okay.

[Enrico Fermi Institute] 11:17:03
And then they. They limited us. But it's it's fine, I mean, the limit is not restricting.

[Enrico Fermi Institute] 11:17:10
The limit is still. Hi Enough that we don't have a problem using up the allocation over email.

[Enrico Fermi Institute] 11:17:14
We just couldn't do what we tried to do. Which is, do these 100 K core groups?

[Enrico Fermi Institute] 11:17:20
Because at that point the traffic was too high. Yeah.

[Enrico Fermi Institute] 11:17:27
Oh, yeah, I was at the network connectivity. So we discussed the facility potentially for facility limits.

[Enrico Fermi Institute] 11:17:35
Here, then another limitation can be storage. A. If you use it, for if you use shared storage for input out to date output data, you would have to integrate it into the data management solution, because you Basically, have to prepase later you want to process and then to stage out the data, later, awesome

[Enrico Fermi Institute] 11:17:51
criminals from the job execution part 2 through your own storage, but also another.

[Enrico Fermi Institute] 11:18:01
The consideration is whether job scratch is local or shared.

[Enrico Fermi Institute] 11:18:04
For instance, the Lcf. Usually have only shared storage.

[Enrico Fermi Institute] 11:18:08
They don't give you any local storage. Most of the, And is that funded side access in fronttera?

[Enrico Fermi Institute] 11:18:16
They give you local scratch, and that is another area where you can run to scaling invitations.

[Enrico Fermi Institute] 11:18:22
and looking a bit ahead. So this is what we're doing now.

[Enrico Fermi Institute] 11:18:26
If you look ahead to the Hrxc area like assuming the resource mixed shifts, and we get more Hpc.

[Enrico Fermi Institute] 11:18:36
Resources can make. Can we have still a forward to restrict the workforce?

[Enrico Fermi Institute] 11:18:42
Everyone there. Oh, is it? Is that basically restricting ourselves in terms of what we can do operation.

[Enrico Fermi Institute] 11:18:53
And right now we do what's easiest, And that's that that just came out of starting up this.

[Enrico Fermi Institute] 11:18:59
And of course, you start up with what's easy to just get something to run.

[Enrico Fermi Institute] 11:19:03
But as you became experienced with it, and as the amount of resources goes up, that might not be enough to keep scaling up, I'm to take advantage of opportunities.

[Enrico Fermi Institute] 11:19:15
No, from Shaqi

[Shigeki] 11:19:18
just not a curiosity. This is sort of the state of trying to get to work at the end Pc.

[Shigeki] 11:19:26
Centers as they exist now. Is there any general motivation on the Hpc site side to sort of meet us halfway, and and or do they recognize that that that that maybe this is the future they really need to wreck to to meet the external workflows

[Enrico Fermi Institute] 11:19:44
Hey? That is there is, but you have to again distinguish between the user facilities and the Lcf: So with the user facilities, we've had very good experience, especially with nurse working with them.

[Shigeki] 11:19:46
halfway, and a common sort of way

[Enrico Fermi Institute] 11:20:00
nurse. We started like 2,016 cms, had our first allocation there, and we started to work, and we started to target these type of work.

[Enrico Fermi Institute] 11:20:10
Frozen, don't we? We tested remote data access and it was Kilobytes per second to each note and the claim the Corey design goal was gigabit to the knowledge.

[Enrico Fermi Institute] 11:20:22
From Ecf. Next time and then, obviously something in the stack didn't work.

[Enrico Fermi Institute] 11:20:25
So we worked with them for multiple years, And now it's actually we're kind of there.

[Enrico Fermi Institute] 11:20:29
Where we're supposed to be. Everything works great, so they are very interested in work with us.

[Enrico Fermi Institute] 11:20:36
the Lcf. I don't think we have that relation that that relationship

[Steven Timm] 11:20:40
cool.

[Enrico Fermi Institute] 11:20:42
I. It would be great if we had, but we don't

[Steven Timm] 11:20:46
So nurse goes also, already going over for Nurse Town, which is the machine It comes after pearl mudder talk to.

[Steven Timm] 11:20:54
oh, what do you call it? High throughput people, and see, What do we need for the next thing? So they're talking.

[Steven Timm] 11:20:59
They start numbers, they're talking to doing whatever so those means already happening for the next round.

[Enrico Fermi Institute] 11:21:04
Yeah.

[Steven Timm] 11:21:05
But the but the others, as you say, are not happening at the moment

[Enrico Fermi Institute] 11:21:08
Yeah, the feedback from nurse we got is that they're very interested, supporting data, intensive science And they took what they learned.

[Enrico Fermi Institute] 11:21:16
And Corey to running these kinds of workload staff They take that into consideration for designing the next machine Yeah, and in fact, I think what he will will But he will say hopefully, say something data, intensive science assume data intensive pulling stuff from the land because that's a different it's a

[Enrico Fermi Institute] 11:21:34
different issue, right? I mean, Yeah, it can be streaming things. You mentioned that I yes, as they scale up, you know, we want to put more workflows on.

[Enrico Fermi Institute] 11:21:45
We have to be cognizant of the the the intrinsic design limitations of the clusters, I mean they are intense of sign running data, intensive science on a facility that means you stream everything, in and stream it out or you need local storage to to to cash that value process

[Enrico Fermi Institute] 11:22:01
later that's that's the simple These are the 2 options here, and that's what I mentioned about storage.

[Enrico Fermi Institute] 11:22:10
It depends what each facility gives you. If you don't have a lot of attached storage, and you can get only a small storage board I've compared to your Cpu quota then you don't have a lot of options in terms of how to make use

[Enrico Fermi Institute] 11:22:23
of that Cpu quota. If you, if you do get a lot of storage, and you can run it like like we run regular production on a grid side.

[Enrico Fermi Institute] 11:22:34
We pre-stage with in our data management systems. We run you stage back things back out that makes things simple.

[Ian Fisk] 11:22:40
oh!

[Enrico Fermi Institute] 11:22:41
You say, Do you have an idea of like what the scale there would be to make it?

[Enrico Fermi Institute] 11:22:45
Make these facilities more cool. I mean, I know the all park figure we usually say Cms side with a sizable amount of Cpu would be would like to have like 500 TB space Roughly I I'd say simply hundreds of terabytes Yeah, I mean Yeah, we could use probably

[Enrico Fermi Institute] 11:23:02
3, 4, but around that that point, if it's less than 100, it gets difficult.

[Enrico Fermi Institute] 11:23:06
Yeah, And that's usually where we are with the experience from Lcc. Grants, for instance, usually 150 is kind of the cut of. That's not a lot

[Enrico Fermi Institute] 11:23:23
Of course it would be nice if we ask for a large storage allocation, and just, you know, you can ruse your storage element Yeah, you know.

[Enrico Fermi Institute] 11:23:29
Treat it like another site, but then that also comes into You know that the storage allocations over long periods of time to expect, rather than a yearly kind of allocation

[Enrico Fermi Institute] 11:23:43
Yeah.

[Ian Fisk] 11:23:43
Yeah, I'm I'm wondering if there, if somehow the concept is streaming in or to local storage, is a distinction without a lot of a difference.

[Ian Fisk] 11:23:52
It's more about the time. Scale, right? The they have a 100 TB of data.

[Enrico Fermi Institute] 11:23:54
Yeah.

[Ian Fisk] 11:23:56
You're either streaming it directly in real time, or you're staging it and staging it out because 100 TB of data is not a ton of space on a large scale.

[Enrico Fermi Institute] 11:24:01
Yeah, there's a small technical difference, because one just you just keep the data in job scratch.

[Enrico Fermi Institute] 11:24:12
And then the other case. You have to place it somewhere that's independent of job execution, and that that can have a technical difference, because I don't think, for instance, nurse doesn't count sharp scratch against your scratch border.

[Ian Fisk] 11:24:27
Okay.

[Enrico Fermi Institute] 11:24:29
While if you, if you put in something via the Dtn via the data transfer notes that does count against. And I think a lot of it's also cultural right in terms of

[Enrico Fermi Institute] 11:24:42
Not commonly seeing flows that stream data for better, and what what most people expect and toward are data coming and through Dtn's the file system.

[Enrico Fermi Institute] 11:25:00
And some time later processes. So the

[Ian Fisk] 11:25:04
But somehow, like there's a balance here that says between the networking and the local, storage and the Io of the jobs that you need to have a suspicion amount of.

[Ian Fisk] 11:25:12
I, or to keep the resources busy and so. It's not much more complicated like.

[Ian Fisk] 11:25:20
And in the test to be a convergent system, in the sense that you don't have, you're not gonna be able to have the storage forever.

[Enrico Fermi Institute] 11:25:27
Okay, Yeah, it's you, of course. Write that the Yeah, it's a It's a It's a storage management problem more than it?

[Enrico Fermi Institute] 11:25:35
Is, it's a storage problem, and you

[Ian Fisk] 11:25:39
Was that with I'm claiming it's a data delivery problem whether it's being streamed in or whether it's being cast from stream.

[Ian Fisk] 11:25:45
It's that they are effectively that both of them are the same problem, which is that, How do I get data?

[Ian Fisk] 11:25:52
And if I look at the time scale of a if something streaming in, it's sort of a real time problem, and it's it's it's a little bit simpler in the sense that I it's a it's a network, it's a I know the I o when there's

[Ian Fisk] 11:26:03
not a long like there's. But if I expand it out to the time scale of even just a couple of weeks, it's still I staging it in requires a certain amount of networking staging now.

[Ian Fisk] 11:26:13
How much time do I have this particular resources? It

[Enrico Fermi Institute] 11:26:17
So dark. Doesn't this depend on the scheduling modality of of the Hpc.

[Enrico Fermi Institute] 11:26:22
because like, because they they tend to come.

[Enrico Fermi Institute] 11:26:26
You know you tend to get put into a queue.

[Enrico Fermi Institute] 11:26:29
You're waiting for another. And then suddenly, you have on use.

[Enrico Fermi Institute] 11:26:37
It's simpler If If you'd strange, you remove the data management hard from the equation, because you assume you just can pull it when you need it.

[Enrico Fermi Institute] 11:26:48
But you can't do that if you're being scheduled for you, where you're suddenly getting 50,000 Course you've been waiting for 2 weeks, Then on Monday morning they give you 50,000 cores.

[Enrico Fermi Institute] 11:26:57
And you've got no data there right Well, if you assume that these 50,000 cores can access the data via streaming, then you can hold them.

[Steven Timm] 11:27:04
Great

[Enrico Fermi Institute] 11:27:06
Yeah, You hold them somewhere else, and you don't need to schedule the data.

[Enrico Fermi Institute] 11:27:09
So data deliveries on demand. If you efficiency eventually, you hit scaling limits.

[Steven Timm] 11:27:09
Right.

[Steven Timm] 11:27:12
Good.

[Enrico Fermi Institute] 11:27:18
But that's more a question that then the network comes in, and how our own sides are dimension.

[Enrico Fermi Institute] 11:27:23
This is still the introduction we have the Dhc focus area.

[Enrico Fermi Institute] 11:27:27
We also have a couple, so I don't want to go too deep into it.

[Steven Timm] 11:27:28
Okay.

[Enrico Fermi Institute] 11:27:30
But I think the point is, if you think about architectural point of view.

[Enrico Fermi Institute] 11:27:33
Having the data that you need on site for your your could be enormously, because it's probably sized you.

[Steven Timm] 11:27:37
Alright.

[Enrico Fermi Institute] 11:27:43
You hope that if uses that the site is sized, appropriate for course may or may not be true on cases, and he also of of reliability.

[Steven Timm] 11:27:48
Great

[Steven Timm] 11:27:52
Okay.

[Enrico Fermi Institute] 11:27:53
You want to do is wait 2 weeks. Get your 50,000 cores if you're ped out.

[Enrico Fermi Institute] 11:27:58
Today was the day that there was this So we got a couple of 2 questions.

[Ian Fisk] 11:27:59
But

[Enrico Fermi Institute] 11:28:04
She got again

[Steven Timm] 11:28:05
Yeah. So you have to consider it only the size of the file system.

[Shigeki] 11:28:08
Hello!

[Steven Timm] 11:28:12
Sorry, but also the reliability of the file system, and also the eye ups of reading the file system, because we managed to scramble the luster file system pack pretty badly several times.

[Steven Timm] 11:28:24
Thanks Larry I'm not sure it's Lester.

[Steven Timm] 11:28:26
But anyway, I mean, it's just scramble They're scratched very badly.

[Steven Timm] 11:28:29
Call times in cool motors, having issues too. It's not our fault.

[Steven Timm] 11:28:33
But there! Oh, spacecraft systems are not always meant to take seamless level.

[Steven Timm] 11:28:38
I o Yep, we have to be prepared for this.

[Enrico Fermi Institute] 11:28:43
Oh, I always especially if you look at generator type or flows, is not great.

[Steven Timm] 11:28:43
Something won't be

[Enrico Fermi Institute] 11:28:49
Not they're basically built for desktop and we scale it up to gridlock.

[Enrico Fermi Institute] 11:28:54
If we have Joe coming

[Shigeki] 11:28:56
Yeah, I guess my, my my fundamental question is, is sort of all of these issues are sort of best done at the design phase of the Hpc center.

[Shigeki] 11:29:06
And I'm kind of wondering. Does the community have an official avenue in which to present our issues and and and work with them at the design space of the Hpc center, where where we can we can both agree on on on the the the mechanism for moving the data in and out

[Enrico Fermi Institute] 11:29:27
Not really, not at the moment I think the the user facilities are at least aware of what we're doing.

[Enrico Fermi Institute] 11:29:34
The type of work we're doing because they see this more often.

[Enrico Fermi Institute] 11:29:37
The Lcf. I don't think we are not not at this level, because they are really.

[Enrico Fermi Institute] 11:29:45
They're targeting these things. Give me a 1,000 notes from my letters QCD.

[Enrico Fermi Institute] 11:29:50
Calculation or protein folding, or whatever they're doing.

[Enrico Fermi Institute] 11:29:52
Let's stay out the target market basically

[Shigeki] 11:29:55
but I mean, probably that's because of the fact that that's the target market that they see.

[Shigeki] 11:30:00
And it's sort of a chicken and egg problem.

[Shigeki] 11:30:01
They're not going to see the high throughput issues, because it's so hard to do it, and they're not gonna do anything about it because they just don't see it it's it's really a chicken and end.

[Enrico Fermi Institute] 11:30:11
But then falsehood on their Congressional mandate.

[Enrico Fermi Institute] 11:30:13
So why would they go against the Congressional mandate that I think this is also a discussion.

[Enrico Fermi Institute] 11:30:18
That's that's we do. That's too high level for us to have any imported.

[Enrico Fermi Institute] 11:30:25
So I know they have discussions going on at the very high level for them to support these type of science better.

[Enrico Fermi Institute] 11:30:33
But until there's actually a as Brian said, as A, until there's actually a mandate for them, and sometimes that they're supposed to support us better.

[Enrico Fermi Institute] 11:30:41
I don't think they're going to move a lot in terms of making making their facilities work better computation that they're doing so What I mean.

[Enrico Fermi Institute] 11:30:53
Is that Apsu works with Alc. F. About taking data from their light source and streaming.

[Enrico Fermi Institute] 11:31:04
I believe nurse is in conversations with a couple of the West Coast light sources, and I remember one talk I was at.

[Enrico Fermi Institute] 11:31:13
I think Olcf was talking about doing that also from like the neutron source, and some of the accelerators on on campus.

[Taylor Childers] 11:31:21
can I? Right Yeah, bye? Sorry: So I was just.

[Enrico Fermi Institute] 11:31:22
So we have a comment from from Taylor: Correct: Yeah.

[Taylor Childers] 11:31:28
Gonna And I mean Doug brought up another good point. But so, just to comment on a few of the things.

[Taylor Childers] 11:31:36
so I'll go to Aps first. So our new Polaris machine actually has 60 some odd nodes dedicated like we purchased in addition for the Aps for real time processing so the idea is that the you know, workflows there have live detectors that are

[Taylor Childers] 11:31:59
taking data. And we want to see if we can get those scientists on our machines when it comes to the design process for the new machines.

[Taylor Childers] 11:32:10
Right, for instance, with Aurora we had the Aurora Early Science program.

[Taylor Childers] 11:32:16
Olcf had a similar program same for pro mutter.

[Taylor Childers] 11:32:20
Those are entirely designed to how communities get on, you know.

[Taylor Childers] 11:32:27
Get early access to our machine that occurred. Atlas submitted one of those projects, and has had myself, and, in fact, a postdoc funded through Alcf to help mostly event generators.

[Enrico Fermi Institute] 11:32:28
Yeah.

[Taylor Childers] 11:32:46
At this point, user Aurora moving forward. So there is a program for helping to be involved in the early process of design for the machine.

[Taylor Childers] 11:33:02
So, for instance, with the Atlas case, Mad Graph is constantly reported in the Intel meetings for Aurora.

[Taylor Childers] 11:33:11
As far as performance and capability, because, you know, we're one of the early science project for projects.

[Taylor Childers] 11:33:23
but the other, I would say the other end of the spectrum.

[Taylor Childers] 11:33:27
There is. Of course, if you're a big user, right?

[Taylor Childers] 11:33:30
And I think Atp has always had the potential to be big users at the Lcfs.

[Enrico Fermi Institute] 11:33:31
Okay.

[Taylor Childers] 11:33:39
granted. There are hurdles, especially now with architectures, but if you're a big user, you have a big sway, right?

[Taylor Childers] 11:33:49
I mean, the lattice. QCD. Groups. They can use our entire machines.

[Taylor Childers] 11:33:53
They use them effectively, and of course we panander to them, I would say unofficially, I guess, but I mean, they get huge sway at our meetings because they are able to effectively use our our resources and same for like, I mean everybody knows the hack group solman's group

[Taylor Childers] 11:34:12
and the climate scientists, right material scientists, the software that our community base where they're easy to port to the next generation.

[Taylor Childers] 11:34:23
Hardware. They move quickly. The communities move quickly, and they all use similar hard software.

[Taylor Childers] 11:34:28
They get a lot of pull in those discussions. Now, the the last thing I wanted to mention, the difference between nurse and the Lcs, I would say, is that Lcfs.

[Taylor Childers] 11:34:42
get less. They have less

[Taylor Childers] 11:34:48
Funding for deploying a lot of user centric hardware.

[Taylor Childers] 11:34:54
So we've been talking to Alcf. I don't know how long for trying to get up a you know.

[Taylor Childers] 11:35:01
Aside cluster for kubernetes and stuff like that where you guys could run all of these services. And, as far as I can tell her up, team, our operations team is just swamped with stuff to do and so that becomes a limiting factor for us

[Enrico Fermi Institute] 11:35:21
Thanks, Taylor. I think that was kind of the the direction of my comment.

[Enrico Fermi Institute] 11:35:26
We We have to make sure, you know. Pretty good at Lcf. If they build a machine to be Hpc machines, there's a you you want to make yourself look like the qCD folks and do Hpc.

[Enrico Fermi Institute] 11:35:39
Work. It's it becomes a huge. Ask for them to to try to do Htc: type Workforce because of the exact sort of pressures you just outlined.

[Taylor Childers] 11:35:51
Yeah.

[Enrico Fermi Institute] 11:35:52
So we have a couple more questions on Zoom. Let's take these questions and then move on to the Cloud section column.

[Paolo Calafiura (he)] 11:36:02
I guys. So it's actually a comment following up on this.

[Paolo Calafiura (he)] 11:36:07
And I I find that if I'm useful sometimes to think, to put boot myself in the shoes of the other of the other partner, when when we have any discussion, I mean think think it from the point of view of of an Lcf today, basically Hp.

[Paolo Calafiura (he)] 11:36:25
Is using Hpcs at arms. Length. Let's be honest, I mean we We have some nice tier, 2 like facility.

[Paolo Calafiura (he)] 11:36:31
A nurse. We we we are pretty happy with with the way Nasty is is working, but you know QCD.

[Paolo Calafiura (he)] 11:36:42
we're talking about. If the Lcf.

[Paolo Calafiura (he)] 11:36:44
Did not exist, today will not be able to assignments.

[Paolo Calafiura (he)] 11:36:46
And so that is something that then, as yes, anyone I mean, we'll consider I mean, am I fundamental, or am I just one of the 25?

[Paolo Calafiura (he)] 11:36:55
32 in the in the in the Federation.

[Paolo Calafiura (he)] 11:37:00
So it is. I think I think, at least for the next generation of Hpcs, not Oururora, but the one after our own, the ones which will start in the twenty-firties or so maybe we have maybe, we have a shot but we will need to make 2 today to make a

[Paolo Calafiura (he)] 11:37:21
company which I don't know if we are ready to make today, which is to say that the at least in the Us.

[Paolo Calafiura (he)] 11:37:29
the Hpcs would become a fundamental part, and not just a beyond the pledge accessory to our 2, 1: one yeah, That's that's that's also because of the enormous amount of effort we would have to put as is, this is, it being said, a couple.

[Paolo Calafiura (he)] 11:37:47
Of times, to to be able to exploit these architectures.

[Paolo Calafiura (he)] 11:37:51
So I think either we jump or we or we stay with the with our friendly talk, and there's people not to work there

[Enrico Fermi Institute] 11:37:58
Okay.

[Enrico Fermi Institute] 11:38:06
in comments.

[Ian Fisk] 11:38:07
yeah, yeah, my comment was sort of along the lines of I've also responded to Shaggy, And from, I think, one of the things we need to be a little bit careful of is sort of what our expectations are and the biggest one is that these facilities were not built for us and that we know.

[Ian Fisk] 11:38:23
that but that doesn't mean that they can't be useful to us.

[Ian Fisk] 11:38:27
at the same time, we can't expect to use all of them, and I think well, there's a frontier is 10 times the size of the Wcg.

[Ian Fisk] 11:38:36
Combined in terms of of floods, and so wouldn't even want to use the whole thing.

[Ian Fisk] 11:38:42
but from the standpoint of like the stability of the palaces, my Steve was saying, the the scale of file system.

[Ian Fisk] 11:38:49
I think all these things are things that we actually can measure, and benchmark, and look at how much of a of a Lcf.

[Ian Fisk] 11:38:55
We might reasonably able to take advantage of in the workflow that is not designed for it.

[Ian Fisk] 11:39:00
And instead of having an expectation that they will be somehow different, they will design these facilities for us.

[Ian Fisk] 11:39:04
They won't. They built for already, And the question is like, Can we still is, is a Ferrari still useful to us at some scale, and the only really way to do that is to measure it is to have a a Benchmark?

[Ian Fisk] 11:39:16
Which we can use, says this is how many resources you can expect to take advantage of before you exceed the local file system where the local network, or the local whatever else, and it seems like this is a tractable problem and These resources exist.

[Ian Fisk] 11:39:33
We we can over the course of time. If we demonstrate that we use them at all, maybe we'll have an influence on the next generation to make them useful for us, too, but I I think that it's not we're not gonna be a situation where we can have basically all of our stuff looks like ai

[Ian Fisk] 11:39:49
and so it's it's a simple transition over to Hbc.

[Ian Fisk] 11:39:53
We're not gonna like our stuff. Looks like our stuff.

[Ian Fisk] 11:39:56
It's not gonna look like lattice is not gonna look like, Yeah, I necessarily completely.

[Ian Fisk] 11:40:00
But we I think, if we say we know what our workflows look like.

[Ian Fisk] 11:40:04