Workshop for the USATLAS-USCMS HPC/Cloud Blueprint

US/Central
201 (University of Chicago)

201

University of Chicago

Room 201 Michelson Center for Physics 933 E. 56th Street Chicago, IL 60615
Registration
Participants
    • 10:00 12:20
      First Day Morning

       

      Monday Morning session
      -----------------------
      (Eastern Time)


      [Enrico Fermi Institute] 11:09:56
      Everybody can hear just fine. Yeah, Because I'm sitting here I'm just getting picked up by the mic on the ceiling. Okay?

      [David Mason] 11:09:59
      we can hear.

      [Enrico Fermi Institute] 11:10:01
      Great. Great. Thank you. Okay. So if you go to the next slide, So the first area that we want to cover is looking a little bit on the what we're doing in terms of workflows on Hbc and cloud and to do that maybe at the very first we look at what resources are we actually looking at

      [Enrico Fermi Institute] 11:10:19
      here at the right now So if you look at what's available for us, Hbc: we have broadly 2 types of facilities, and they have different new user experiences in terms of how you approach them how you can use them and there's the leadership class facilities funded by doe argon

      [Enrico Fermi Institute] 11:10:38
      Oakridge, and so on, and they are kind of.

      [Enrico Fermi Institute] 11:10:41
      They're very restricted. They focus on accelerators to get the most flops for good power budget.

      [Enrico Fermi Institute] 11:10:47
      They don't care too much about making it easy for the user.

      [Enrico Fermi Institute] 11:10:50
      You are expected to adjust your work, for to be able to run there, and they target large scale workforce.

      [Enrico Fermi Institute] 11:10:57
      This is the kind of stuff that you can do. Nowhere else.

      [Enrico Fermi Institute] 11:10:59
      You go to the and then the the user, facilities. Nurse tag, the exceed excess sites, which is they usually a mix.

      [Enrico Fermi Institute] 11:11:11
      Some of them are straight out like they look like Hpc compute clusters, and how they build.

      [Enrico Fermi Institute] 11:11:17
      Some of them have interconnects There might be a mix of gpus and cpus, mostly still cpus, and they take all comments.

      [Enrico Fermi Institute] 11:11:25
      Basically you can get an allocation. You can get going.

      [Enrico Fermi Institute] 11:11:28
      They work with you to try to make it easy to, so you can get on the facility and get your work next slide. And at any time if you wanna have a comment or a question, please just ask it we're not supposed to go.

      [Enrico Fermi Institute] 11:11:42
      Through the big presentation. So it's discussion. Yep.

      [Enrico Fermi Institute] 11:11:48
      so, and then looking at that, with that in mind, What are we currently running there?

      [Enrico Fermi Institute] 11:11:53
      So this is the right now, and the green. If you see a green that's straightforward copy from the charge, there's a question I was asked to ask, so to answer that here what we're doing right now so for cms.

      [Enrico Fermi Institute] 11:12:05
      What we're doing is we basically anything that starts with a generator step and has no input except for pile up.

      [Enrico Fermi Institute] 11:12:12
      We currently assigning to a lot of us Hbc sites You don't have to do anything special to work for gets injected with automatically.

      [Enrico Fermi Institute] 11:12:19
      You Can run there and that's that was the majority of run to Monte Carlo.

      [Enrico Fermi Institute] 11:12:24
      Workflows and the run. Three-month caller work was kind of, not a political sense, and for Atlas it's primarily simulation.

      [Enrico Fermi Institute] 11:12:33
      Usually they are specifically so to Hbc size. So you select the bunch of, I guess you pick.

      [Enrico Fermi Institute] 11:12:40
      This is this: This is a good fit, and then you assign it there, and it runs.

      [Enrico Fermi Institute] 11:12:42
      It, and they also have the goal to expand on that the limiting fact factors.

      [Enrico Fermi Institute] 11:12:53
      If in what workflows you target at Hbc are usually based on machine characteristics, So Cpu architectures, certain Hpc: I mean intel is easy to use when it gets beyond that, currently still a little bit difficult did you have a Gpu Accelerator: how much memory.

      [Enrico Fermi Institute] 11:13:10
      You have per call, remember, Perkins, with Kl. Kind of the dying breed, is kind of disappearing a bit.

      [Enrico Fermi Institute] 11:13:16
      So it's usually okay. Now then, network connectivity, And it's not just tune from the note, like by the Scf falls It's also for the facility as a whole.

      [Enrico Fermi Institute] 11:13:27
      Sometimes Hbc: Yes. Facility, restrictions or firewall limits where you once, when you scale up, you hit scaling limits where you basically overlook go to the pipe because they don't.

      [Enrico Fermi Institute] 11:13:38
      They're not used to such data. Intensive workflows.

      [Enrico Fermi Institute] 11:13:41
      So again I quick question back when we were talking about Cpu architecture and loading point up operations.

      [Enrico Fermi Institute] 11:13:47
      Yeah, what in particular is making that hard from your perspective. It's basically showing the arm or something going to arm is not harder.

      [Enrico Fermi Institute] 11:13:57
      It's just a matter of extra work to validate the platform.

      [Enrico Fermi Institute] 11:14:00
      Okay, So it's really about numerical outcomes and making sure that things agree between Yeah, it's basically a one-time investment of basically, being able to support the platform that's true, on all of them though cause that's not true for the the yeah, Olcf: was a bit is a bit of

      [Enrico Fermi Institute] 11:14:18
      a also have his power, you know. Cms just finished the power validation.

      [Enrico Fermi Institute] 11:14:23
      Okay, So you. So the the requirement, then, is the the effective requirement is for a given sort of Cpu architecture.

      [Enrico Fermi Institute] 11:14:32
      The upstream code has to be valid. It Well, firstly, you have to build your code.

      [Enrico Fermi Institute] 11:14:38
      I've got to be buildable, and then and then you need to run like whatever physics validation you produce, Some samples.

      [Enrico Fermi Institute] 11:14:43
      And then the physicist, the physics group, whatever in the global collaboration, needs to go in and say, this is actually okay.

      [Enrico Fermi Institute] 11:14:50
      So there's a depend. So, therefore, there's a dependency and a on external to to you.

      [Enrico Fermi Institute] 11:15:02
      requires labor from outside of us, because the Us.

      [Enrico Fermi Institute] 11:15:07
      Can't just say this platform is validated. The experiment, as a whole has to say that so, coming back to the why, you couldn't do pile up during digitization, because you had to read extra remotely you Can do it?

      [Enrico Fermi Institute] 11:15:24
      And that's we spot that basically we don't current.

      [Enrico Fermi Institute] 11:15:28
      We currently don't run anything that needs primary Newport but Pilot is supported because Pilot is is so unevenly distributed because of its size that we anyways for normal production even on some tier 2 sites they also read it remotely, so that's the use case that the support anyways to

      [Enrico Fermi Institute] 11:15:48
      the x, So the Hpc. Just expanded. So it's not a limitation.

      [Enrico Fermi Institute] 11:15:53
      No, yeah, I mean, eventually, as you scale up, it comes. Then the network connectivity comes in.

      [Enrico Fermi Institute] 11:15:59
      We have to look at that, for instance, at Frontera we're hitting scaling limits because of remote Parliament.

      [Enrico Fermi Institute] 11:16:06
      I thought it frontier. There was a limit on the amount of yeah, the amount of remote access you could do from around.

      [Enrico Fermi Institute] 11:16:13
      You can see. Yeah, So we actually hit the external connectivity limit of the facility.

      [Enrico Fermi Institute] 11:16:19
      And as I recall, Frontera, they mostly consider their ethernet to be like a control plane.

      [Enrico Fermi Institute] 11:16:26
      Each node in the rack is connected at one giving, and each rack is connected.

      [Enrico Fermi Institute] 11:16:31
      At 10 years ago. I think something like that to your core.

      [Enrico Fermi Institute] 11:16:36
      So in that case you probably weren't doing a lot of pile up at front.

      [Enrico Fermi Institute] 11:16:38
      We were reading Pilot: Okay, So you aren't hitting, I mean, But you are running.

      [Enrico Fermi Institute] 11:16:42
      You were used. You're accessing your pilot data sets by Ethernet, though.

      [Enrico Fermi Institute] 11:16:48
      Yeah, So you aren't hitting. You're still hitting the overall capacity of the of of attack.

      [Enrico Fermi Institute] 11:16:54
      Then, yeah, like 100 gig, or something like the well, we in the beginning, we hit the we actually hit the the scaling limitations on Fi: one trying to get okay.

      [Enrico Fermi Institute] 11:17:03
      And then they. They limited us. But it's it's fine, I mean, the limit is not restricting.

      [Enrico Fermi Institute] 11:17:10
      The limit is still. Hi Enough that we don't have a problem using up the allocation over email.

      [Enrico Fermi Institute] 11:17:14
      We just couldn't do what we tried to do. Which is, do these 100 K core groups?

      [Enrico Fermi Institute] 11:17:20
      Because at that point the traffic was too high. Yeah.

      [Enrico Fermi Institute] 11:17:27
      Oh, yeah, I was at the network connectivity. So we discussed the facility potentially for facility limits.

      [Enrico Fermi Institute] 11:17:35
      Here, then another limitation can be storage. A. If you use it, for if you use shared storage for input out to date output data, you would have to integrate it into the data management solution, because you Basically, have to prepase later you want to process and then to stage out the data, later, awesome

      [Enrico Fermi Institute] 11:17:51
      criminals from the job execution part 2 through your own storage, but also another.

      [Enrico Fermi Institute] 11:18:01
      The consideration is whether job scratch is local or shared.

      [Enrico Fermi Institute] 11:18:04
      For instance, the Lcf. Usually have only shared storage.

      [Enrico Fermi Institute] 11:18:08
      They don't give you any local storage. Most of the, And is that funded side access in fronttera?

      [Enrico Fermi Institute] 11:18:16
      They give you local scratch, and that is another area where you can run to scaling invitations.

      [Enrico Fermi Institute] 11:18:22
      and looking a bit ahead. So this is what we're doing now.

      [Enrico Fermi Institute] 11:18:26
      If you look ahead to the Hrxc area like assuming the resource mixed shifts, and we get more Hpc.

      [Enrico Fermi Institute] 11:18:36
      Resources can make. Can we have still a forward to restrict the workforce?

      [Enrico Fermi Institute] 11:18:42
      Everyone there. Oh, is it? Is that basically restricting ourselves in terms of what we can do operation.

      [Enrico Fermi Institute] 11:18:53
      And right now we do what's easiest, And that's that that just came out of starting up this.

      [Enrico Fermi Institute] 11:18:59
      And of course, you start up with what's easy to just get something to run.

      [Enrico Fermi Institute] 11:19:03
      But as you became experienced with it, and as the amount of resources goes up, that might not be enough to keep scaling up, I'm to take advantage of opportunities.

      [Enrico Fermi Institute] 11:19:15
      No, from Shaqi

      [Shigeki] 11:19:18
      just not a curiosity. This is sort of the state of trying to get to work at the end Pc.

      [Shigeki] 11:19:26
      Centers as they exist now. Is there any general motivation on the Hpc site side to sort of meet us halfway, and and or do they recognize that that that that maybe this is the future they really need to wreck to to meet the external workflows

      [Enrico Fermi Institute] 11:19:44
      Hey? That is there is, but you have to again distinguish between the user facilities and the Lcf: So with the user facilities, we've had very good experience, especially with nurse working with them.

      [Shigeki] 11:19:46
      halfway, and a common sort of way

      [Enrico Fermi Institute] 11:20:00
      nurse. We started like 2,016 cms, had our first allocation there, and we started to work, and we started to target these type of work.

      [Enrico Fermi Institute] 11:20:10
      Frozen, don't we? We tested remote data access and it was Kilobytes per second to each note and the claim the Corey design goal was gigabit to the knowledge.

      [Enrico Fermi Institute] 11:20:22
      From Ecf. Next time and then, obviously something in the stack didn't work.

      [Enrico Fermi Institute] 11:20:25
      So we worked with them for multiple years, And now it's actually we're kind of there.

      [Enrico Fermi Institute] 11:20:29
      Where we're supposed to be. Everything works great, so they are very interested in work with us.

      [Enrico Fermi Institute] 11:20:36
      the Lcf. I don't think we have that relation that that relationship

      [Steven Timm] 11:20:40
      cool.

      [Enrico Fermi Institute] 11:20:42
      I. It would be great if we had, but we don't

      [Steven Timm] 11:20:46
      So nurse goes also, already going over for Nurse Town, which is the machine It comes after pearl mudder talk to.

      [Steven Timm] 11:20:54
      oh, what do you call it? High throughput people, and see, What do we need for the next thing? So they're talking.

      [Steven Timm] 11:20:59
      They start numbers, they're talking to doing whatever so those means already happening for the next round.

      [Enrico Fermi Institute] 11:21:04
      Yeah.

      [Steven Timm] 11:21:05
      But the but the others, as you say, are not happening at the moment

      [Enrico Fermi Institute] 11:21:08
      Yeah, the feedback from nurse we got is that they're very interested, supporting data, intensive science And they took what they learned.

      [Enrico Fermi Institute] 11:21:16
      And Corey to running these kinds of workload staff They take that into consideration for designing the next machine Yeah, and in fact, I think what he will will But he will say hopefully, say something data, intensive science assume data intensive pulling stuff from the land because that's a different it's a

      [Enrico Fermi Institute] 11:21:34
      different issue, right? I mean, Yeah, it can be streaming things. You mentioned that I yes, as they scale up, you know, we want to put more workflows on.

      [Enrico Fermi Institute] 11:21:45
      We have to be cognizant of the the the intrinsic design limitations of the clusters, I mean they are intense of sign running data, intensive science on a facility that means you stream everything, in and stream it out or you need local storage to to to cash that value process

      [Enrico Fermi Institute] 11:22:01
      later that's that's the simple These are the 2 options here, and that's what I mentioned about storage.

      [Enrico Fermi Institute] 11:22:10
      It depends what each facility gives you. If you don't have a lot of attached storage, and you can get only a small storage board I've compared to your Cpu quota then you don't have a lot of options in terms of how to make use

      [Enrico Fermi Institute] 11:22:23
      of that Cpu quota. If you, if you do get a lot of storage, and you can run it like like we run regular production on a grid side.

      [Enrico Fermi Institute] 11:22:34
      We pre-stage with in our data management systems. We run you stage back things back out that makes things simple.

      [Ian Fisk] 11:22:40
      oh!

      [Enrico Fermi Institute] 11:22:41
      You say, Do you have an idea of like what the scale there would be to make it?

      [Enrico Fermi Institute] 11:22:45
      Make these facilities more cool. I mean, I know the all park figure we usually say Cms side with a sizable amount of Cpu would be would like to have like 500 TB space Roughly I I'd say simply hundreds of terabytes Yeah, I mean Yeah, we could use probably

      [Enrico Fermi Institute] 11:23:02
      3, 4, but around that that point, if it's less than 100, it gets difficult.

      [Enrico Fermi Institute] 11:23:06
      Yeah, And that's usually where we are with the experience from Lcc. Grants, for instance, usually 150 is kind of the cut of. That's not a lot

      [Enrico Fermi Institute] 11:23:23
      Of course it would be nice if we ask for a large storage allocation, and just, you know, you can ruse your storage element Yeah, you know.

      [Enrico Fermi Institute] 11:23:29
      Treat it like another site, but then that also comes into You know that the storage allocations over long periods of time to expect, rather than a yearly kind of allocation

      [Enrico Fermi Institute] 11:23:43
      Yeah.

      [Ian Fisk] 11:23:43
      Yeah, I'm I'm wondering if there, if somehow the concept is streaming in or to local storage, is a distinction without a lot of a difference.

      [Ian Fisk] 11:23:52
      It's more about the time. Scale, right? The they have a 100 TB of data.

      [Enrico Fermi Institute] 11:23:54
      Yeah.

      [Ian Fisk] 11:23:56
      You're either streaming it directly in real time, or you're staging it and staging it out because 100 TB of data is not a ton of space on a large scale.

      [Enrico Fermi Institute] 11:24:01
      Yeah, there's a small technical difference, because one just you just keep the data in job scratch.

      [Enrico Fermi Institute] 11:24:12
      And then the other case. You have to place it somewhere that's independent of job execution, and that that can have a technical difference, because I don't think, for instance, nurse doesn't count sharp scratch against your scratch border.

      [Ian Fisk] 11:24:27
      Okay.

      [Enrico Fermi Institute] 11:24:29
      While if you, if you put in something via the Dtn via the data transfer notes that does count against. And I think a lot of it's also cultural right in terms of

      [Enrico Fermi Institute] 11:24:42
      Not commonly seeing flows that stream data for better, and what what most people expect and toward are data coming and through Dtn's the file system.

      [Enrico Fermi Institute] 11:25:00
      And some time later processes. So the

      [Ian Fisk] 11:25:04
      But somehow, like there's a balance here that says between the networking and the local, storage and the Io of the jobs that you need to have a suspicion amount of.

      [Ian Fisk] 11:25:12
      I, or to keep the resources busy and so. It's not much more complicated like.

      [Ian Fisk] 11:25:20
      And in the test to be a convergent system, in the sense that you don't have, you're not gonna be able to have the storage forever.

      [Enrico Fermi Institute] 11:25:27
      Okay, Yeah, it's you, of course. Write that the Yeah, it's a It's a It's a storage management problem more than it?

      [Enrico Fermi Institute] 11:25:35
      Is, it's a storage problem, and you

      [Ian Fisk] 11:25:39
      Was that with I'm claiming it's a data delivery problem whether it's being streamed in or whether it's being cast from stream.

      [Ian Fisk] 11:25:45
      It's that they are effectively that both of them are the same problem, which is that, How do I get data?

      [Ian Fisk] 11:25:52
      And if I look at the time scale of a if something streaming in, it's sort of a real time problem, and it's it's it's a little bit simpler in the sense that I it's a it's a network, it's a I know the I o when there's

      [Ian Fisk] 11:26:03
      not a long like there's. But if I expand it out to the time scale of even just a couple of weeks, it's still I staging it in requires a certain amount of networking staging now.

      [Ian Fisk] 11:26:13
      How much time do I have this particular resources? It

      [Enrico Fermi Institute] 11:26:17
      So dark. Doesn't this depend on the scheduling modality of of the Hpc.

      [Enrico Fermi Institute] 11:26:22
      because like, because they they tend to come.

      [Enrico Fermi Institute] 11:26:26
      You know you tend to get put into a queue.

      [Enrico Fermi Institute] 11:26:29
      You're waiting for another. And then suddenly, you have on use.

      [Enrico Fermi Institute] 11:26:37
      It's simpler If If you'd strange, you remove the data management hard from the equation, because you assume you just can pull it when you need it.

      [Enrico Fermi Institute] 11:26:48
      But you can't do that if you're being scheduled for you, where you're suddenly getting 50,000 Course you've been waiting for 2 weeks, Then on Monday morning they give you 50,000 cores.

      [Enrico Fermi Institute] 11:26:57
      And you've got no data there right Well, if you assume that these 50,000 cores can access the data via streaming, then you can hold them.

      [Steven Timm] 11:27:04
      Great

      [Enrico Fermi Institute] 11:27:06
      Yeah, You hold them somewhere else, and you don't need to schedule the data.

      [Enrico Fermi Institute] 11:27:09
      So data deliveries on demand. If you efficiency eventually, you hit scaling limits.

      [Steven Timm] 11:27:09
      Right.

      [Steven Timm] 11:27:12
      Good.

      [Enrico Fermi Institute] 11:27:18
      But that's more a question that then the network comes in, and how our own sides are dimension.

      [Enrico Fermi Institute] 11:27:23
      This is still the introduction we have the Dhc focus area.

      [Enrico Fermi Institute] 11:27:27
      We also have a couple, so I don't want to go too deep into it.

      [Steven Timm] 11:27:28
      Okay.

      [Enrico Fermi Institute] 11:27:30
      But I think the point is, if you think about architectural point of view.

      [Enrico Fermi Institute] 11:27:33
      Having the data that you need on site for your your could be enormously, because it's probably sized you.

      [Steven Timm] 11:27:37
      Alright.

      [Enrico Fermi Institute] 11:27:43
      You hope that if uses that the site is sized, appropriate for course may or may not be true on cases, and he also of of reliability.

      [Steven Timm] 11:27:48
      Great

      [Steven Timm] 11:27:52
      Okay.

      [Enrico Fermi Institute] 11:27:53
      You want to do is wait 2 weeks. Get your 50,000 cores if you're ped out.

      [Enrico Fermi Institute] 11:27:58
      Today was the day that there was this So we got a couple of 2 questions.

      [Ian Fisk] 11:27:59
      But

      [Enrico Fermi Institute] 11:28:04
      She got again

      [Steven Timm] 11:28:05
      Yeah. So you have to consider it only the size of the file system.

      [Shigeki] 11:28:08
      Hello!

      [Steven Timm] 11:28:12
      Sorry, but also the reliability of the file system, and also the eye ups of reading the file system, because we managed to scramble the luster file system pack pretty badly several times.

      [Steven Timm] 11:28:24
      Thanks Larry I'm not sure it's Lester.

      [Steven Timm] 11:28:26
      But anyway, I mean, it's just scramble They're scratched very badly.

      [Steven Timm] 11:28:29
      Call times in cool motors, having issues too. It's not our fault.

      [Steven Timm] 11:28:33
      But there! Oh, spacecraft systems are not always meant to take seamless level.

      [Steven Timm] 11:28:38
      I o Yep, we have to be prepared for this.

      [Enrico Fermi Institute] 11:28:43
      Oh, I always especially if you look at generator type or flows, is not great.

      [Steven Timm] 11:28:43
      Something won't be

      [Enrico Fermi Institute] 11:28:49
      Not they're basically built for desktop and we scale it up to gridlock.

      [Enrico Fermi Institute] 11:28:54
      If we have Joe coming

      [Shigeki] 11:28:56
      Yeah, I guess my, my my fundamental question is, is sort of all of these issues are sort of best done at the design phase of the Hpc center.

      [Shigeki] 11:29:06
      And I'm kind of wondering. Does the community have an official avenue in which to present our issues and and and work with them at the design space of the Hpc center, where where we can we can both agree on on on the the the mechanism for moving the data in and out

      [Enrico Fermi Institute] 11:29:27
      Not really, not at the moment I think the the user facilities are at least aware of what we're doing.

      [Enrico Fermi Institute] 11:29:34
      The type of work we're doing because they see this more often.

      [Enrico Fermi Institute] 11:29:37
      The Lcf. I don't think we are not not at this level, because they are really.

      [Enrico Fermi Institute] 11:29:45
      They're targeting these things. Give me a 1,000 notes from my letters QCD.

      [Enrico Fermi Institute] 11:29:50
      Calculation or protein folding, or whatever they're doing.

      [Enrico Fermi Institute] 11:29:52
      Let's stay out the target market basically

      [Shigeki] 11:29:55
      but I mean, probably that's because of the fact that that's the target market that they see.

      [Shigeki] 11:30:00
      And it's sort of a chicken and egg problem.

      [Shigeki] 11:30:01
      They're not going to see the high throughput issues, because it's so hard to do it, and they're not gonna do anything about it because they just don't see it it's it's really a chicken and end.

      [Enrico Fermi Institute] 11:30:11
      But then falsehood on their Congressional mandate.

      [Enrico Fermi Institute] 11:30:13
      So why would they go against the Congressional mandate that I think this is also a discussion.

      [Enrico Fermi Institute] 11:30:18
      That's that's we do. That's too high level for us to have any imported.

      [Enrico Fermi Institute] 11:30:25
      So I know they have discussions going on at the very high level for them to support these type of science better.

      [Enrico Fermi Institute] 11:30:33
      But until there's actually a as Brian said, as A, until there's actually a mandate for them, and sometimes that they're supposed to support us better.

      [Enrico Fermi Institute] 11:30:41
      I don't think they're going to move a lot in terms of making making their facilities work better computation that they're doing so What I mean.

      [Enrico Fermi Institute] 11:30:53
      Is that Apsu works with Alc. F. About taking data from their light source and streaming.

      [Enrico Fermi Institute] 11:31:04
      I believe nurse is in conversations with a couple of the West Coast light sources, and I remember one talk I was at.

      [Enrico Fermi Institute] 11:31:13
      I think Olcf was talking about doing that also from like the neutron source, and some of the accelerators on on campus.

      [Taylor Childers] 11:31:21
      can I? Right Yeah, bye? Sorry: So I was just.

      [Enrico Fermi Institute] 11:31:22
      So we have a comment from from Taylor: Correct: Yeah.

      [Taylor Childers] 11:31:28
      Gonna And I mean Doug brought up another good point. But so, just to comment on a few of the things.

      [Taylor Childers] 11:31:36
      so I'll go to Aps first. So our new Polaris machine actually has 60 some odd nodes dedicated like we purchased in addition for the Aps for real time processing so the idea is that the you know, workflows there have live detectors that are

      [Taylor Childers] 11:31:59
      taking data. And we want to see if we can get those scientists on our machines when it comes to the design process for the new machines.

      [Taylor Childers] 11:32:10
      Right, for instance, with Aurora we had the Aurora Early Science program.

      [Taylor Childers] 11:32:16
      Olcf had a similar program same for pro mutter.

      [Taylor Childers] 11:32:20
      Those are entirely designed to how communities get on, you know.

      [Taylor Childers] 11:32:27
      Get early access to our machine that occurred. Atlas submitted one of those projects, and has had myself, and, in fact, a postdoc funded through Alcf to help mostly event generators.

      [Enrico Fermi Institute] 11:32:28
      Yeah.

      [Taylor Childers] 11:32:46
      At this point, user Aurora moving forward. So there is a program for helping to be involved in the early process of design for the machine.

      [Taylor Childers] 11:33:02
      So, for instance, with the Atlas case, Mad Graph is constantly reported in the Intel meetings for Aurora.

      [Taylor Childers] 11:33:11
      As far as performance and capability, because, you know, we're one of the early science project for projects.

      [Taylor Childers] 11:33:23
      but the other, I would say the other end of the spectrum.

      [Taylor Childers] 11:33:27
      There is. Of course, if you're a big user, right?

      [Taylor Childers] 11:33:30
      And I think Atp has always had the potential to be big users at the Lcfs.

      [Enrico Fermi Institute] 11:33:31
      Okay.

      [Taylor Childers] 11:33:39
      granted. There are hurdles, especially now with architectures, but if you're a big user, you have a big sway, right?

      [Taylor Childers] 11:33:49
      I mean, the lattice. QCD. Groups. They can use our entire machines.

      [Taylor Childers] 11:33:53
      They use them effectively, and of course we panander to them, I would say unofficially, I guess, but I mean, they get huge sway at our meetings because they are able to effectively use our our resources and same for like, I mean everybody knows the hack group solman's group

      [Taylor Childers] 11:34:12
      and the climate scientists, right material scientists, the software that our community base where they're easy to port to the next generation.

      [Taylor Childers] 11:34:23
      Hardware. They move quickly. The communities move quickly, and they all use similar hard software.

      [Taylor Childers] 11:34:28
      They get a lot of pull in those discussions. Now, the the last thing I wanted to mention, the difference between nurse and the Lcs, I would say, is that Lcfs.

      [Taylor Childers] 11:34:42
      get less. They have less

      [Taylor Childers] 11:34:48
      Funding for deploying a lot of user centric hardware.

      [Taylor Childers] 11:34:54
      So we've been talking to Alcf. I don't know how long for trying to get up a you know.

      [Taylor Childers] 11:35:01
      Aside cluster for kubernetes and stuff like that where you guys could run all of these services. And, as far as I can tell her up, team, our operations team is just swamped with stuff to do and so that becomes a limiting factor for us

      [Enrico Fermi Institute] 11:35:21
      Thanks, Taylor. I think that was kind of the the direction of my comment.

      [Enrico Fermi Institute] 11:35:26
      We We have to make sure, you know. Pretty good at Lcf. If they build a machine to be Hpc machines, there's a you you want to make yourself look like the qCD folks and do Hpc.

      [Enrico Fermi Institute] 11:35:39
      Work. It's it becomes a huge. Ask for them to to try to do Htc: type Workforce because of the exact sort of pressures you just outlined.

      [Taylor Childers] 11:35:51
      Yeah.

      [Enrico Fermi Institute] 11:35:52
      So we have a couple more questions on Zoom. Let's take these questions and then move on to the Cloud section column.

      [Paolo Calafiura (he)] 11:36:02
      I guys. So it's actually a comment following up on this.

      [Paolo Calafiura (he)] 11:36:07
      And I I find that if I'm useful sometimes to think, to put boot myself in the shoes of the other of the other partner, when when we have any discussion, I mean think think it from the point of view of of an Lcf today, basically Hp.

      [Paolo Calafiura (he)] 11:36:25
      Is using Hpcs at arms. Length. Let's be honest, I mean we We have some nice tier, 2 like facility.

      [Paolo Calafiura (he)] 11:36:31
      A nurse. We we we are pretty happy with with the way Nasty is is working, but you know QCD.

      [Paolo Calafiura (he)] 11:36:42
      we're talking about. If the Lcf.

      [Paolo Calafiura (he)] 11:36:44
      Did not exist, today will not be able to assignments.

      [Paolo Calafiura (he)] 11:36:46
      And so that is something that then, as yes, anyone I mean, we'll consider I mean, am I fundamental, or am I just one of the 25?

      [Paolo Calafiura (he)] 11:36:55
      32 in the in the in the Federation.

      [Paolo Calafiura (he)] 11:37:00
      So it is. I think I think, at least for the next generation of Hpcs, not Oururora, but the one after our own, the ones which will start in the twenty-firties or so maybe we have maybe, we have a shot but we will need to make 2 today to make a

      [Paolo Calafiura (he)] 11:37:21
      company which I don't know if we are ready to make today, which is to say that the at least in the Us.

      [Paolo Calafiura (he)] 11:37:29
      the Hpcs would become a fundamental part, and not just a beyond the pledge accessory to our 2, 1: one yeah, That's that's that's also because of the enormous amount of effort we would have to put as is, this is, it being said, a couple.

      [Paolo Calafiura (he)] 11:37:47
      Of times, to to be able to exploit these architectures.

      [Paolo Calafiura (he)] 11:37:51
      So I think either we jump or we or we stay with the with our friendly talk, and there's people not to work there

      [Enrico Fermi Institute] 11:37:58
      Okay.

      [Enrico Fermi Institute] 11:38:06
      in comments.

      [Ian Fisk] 11:38:07
      yeah, yeah, my comment was sort of along the lines of I've also responded to Shaggy, And from, I think, one of the things we need to be a little bit careful of is sort of what our expectations are and the biggest one is that these facilities were not built for us and that we know.

      [Ian Fisk] 11:38:23
      that but that doesn't mean that they can't be useful to us.

      [Ian Fisk] 11:38:27
      at the same time, we can't expect to use all of them, and I think well, there's a frontier is 10 times the size of the Wcg.

      [Ian Fisk] 11:38:36
      Combined in terms of of floods, and so wouldn't even want to use the whole thing.

      [Ian Fisk] 11:38:42
      but from the standpoint of like the stability of the palaces, my Steve was saying, the the scale of file system.

      [Ian Fisk] 11:38:49
      I think all these things are things that we actually can measure, and benchmark, and look at how much of a of a Lcf.

      [Ian Fisk] 11:38:55
      We might reasonably able to take advantage of in the workflow that is not designed for it.

      [Ian Fisk] 11:39:00
      And instead of having an expectation that they will be somehow different, they will design these facilities for us.

      [Ian Fisk] 11:39:04
      They won't. They built for already, And the question is like, Can we still is, is a Ferrari still useful to us at some scale, and the only really way to do that is to measure it is to have a a Benchmark?

      [Ian Fisk] 11:39:16
      Which we can use, says this is how many resources you can expect to take advantage of before you exceed the local file system where the local network, or the local whatever else, and it seems like this is a tractable problem and These resources exist.

      [Ian Fisk] 11:39:33
      We we can over the course of time. If we demonstrate that we use them at all, maybe we'll have an influence on the next generation to make them useful for us, too, but I I think that it's not we're not gonna be a situation where we can have basically all of our stuff looks like ai

      [Ian Fisk] 11:39:49
      and so it's it's a simple transition over to Hbc.

      [Ian Fisk] 11:39:53
      We're not gonna like our stuff. Looks like our stuff.

      [Ian Fisk] 11:39:56
      It's not gonna look like lattice is not gonna look like, Yeah, I necessarily completely.

      [Ian Fisk] 11:40:00
      But we I think, if we say we know what our workflows look like.

      [Ian Fisk] 11:40:04
      How many of them could we run? It's the possibility that we could get a lot of work done

      [Enrico Fermi Institute] 11:40:11
      They? I think that's a good one more thing from Oh, there's another one. Yeah, and then we're gonna move on.

      [Enrico Fermi Institute] 11:40:17
      Okay.

      [Dale Carder] 11:40:20
      yeah, I was just kind of chime in. There was a question about you know. How does this compare to what you see at the other light sources like Apsu and Alcf: the closest analogy is probably Lcl: Lc: Ls: 2 which is It slack and with The compute

      [Dale Carder] 11:40:36
      some of that being it, nurse. So there's a a review underway much like How?

      [Dale Carder] 11:40:42
      Yes, net Did requirements review with high energy, physics.

      [Dale Carder] 11:40:45
      There's a review under way right now, with the basic energy sciences.

      [Dale Carder] 11:40:49
      Ves, I think that should be published at least in draft form, on a manner of weeks.

      [Dale Carder] 11:40:54
      So may that may be something you could look at. Look at in the timeline of this workshop, and it's deliverables

      [Enrico Fermi Institute] 11:41:01
      Great. Thanks. Yes, Okay, let's move on away from Hpc.

      [Enrico Fermi Institute] 11:41:08
      For the moment and to cloud, I think, and I know we'll go through these slides

      [Fernando Harald Barreiro Megino] 11:41:14
      hi, Yeah, So now it's the similar discussion. But for Cloud, what are the work that can be executed?

      [Fernando Harald Barreiro Megino] 11:41:26
      cloud resources and before getting there, what we have been mostly considering during our previous discussions for the blueprint process I like the Major Commercial Cloud Provider, like Google Amazon Microsoft, which are the ones that we have been really testing in the in the last couple of years.

      [Fernando Harald Barreiro Megino] 11:41:45
      and here all of these have different service levels, so they provide infrastructure as a service where you would run that machine install, and then whatever you want platform as a service for higher level software, as a service but nowadays, all of these clouds also have emerging intermediate levels in particular

      [Fernando Harald Barreiro Megino] 11:42:05
      container from service. Like, for example, coordinates were other versions of Kubernetes, like service container executions.

      [Fernando Harald Barreiro Megino] 11:42:16
      on these services along like cloud, native approaches, to integrate our experiment.

      [Fernando Harald Barreiro Megino] 11:42:24
      frameworks across the cloud providers, so that all of them look the same.

      [Fernando Harald Barreiro Megino] 11:42:30
      Yeah, And then the other thing that is the other cloud provider that this being lately is

      [Fernando Harald Barreiro Megino] 11:42:47
      I'm they differentiate themselves through in particular, like sustainability, and the usage of the renewable energy They are also much more affordable than Google, But they how are also not a full blown cloud they just have a limited services and also

      [Fernando Harald Barreiro Megino] 11:43:09
      reliability probably depending on on how much renewable energy at the moment.

      [Fernando Harald Barreiro Megino] 11:43:17
      And so Cms is trying to. I've integrated them once.

      [Enrico Fermi Institute] 11:43:18
      Okay.

      [Fernando Harald Barreiro Megino] 11:43:20
      I'm already for some simple tests. So next slide

      [Fernando Harald Barreiro Megino] 11:43:29
      so for for outlaws. Then, coming to the question, What are the hey guys that are possible to execute on the cloud?

      [Fernando Harald Barreiro Megino] 11:43:39
      So we are integrating lately. Clouds like independent, complete, completely independent, and self managed sites with a storage element, that also compute for that is integrated in Panda the most What we have the most experiences.

      [Fernando Harald Barreiro Megino] 11:44:01
      With. Google. And we started in middle of tonight to run.

      [Fernando Harald Barreiro Megino] 11:44:05
      similar to our Us. The tool, size, the cluster, and we are running now.

      [Fernando Harald Barreiro Megino] 11:44:11
      10,000 calls and currently are limited to production workloads.

      [Fernando Harald Barreiro Megino] 11:44:17
      But that's just because we are reorganizing the the storage behind it, and we plan to enable them now this is in a couple of weeks.

      [Fernando Harald Barreiro Megino] 11:44:29
      the one thing that maybe you want to control is the amount of E address to to to to bring down the cost.

      [Fernando Harald Barreiro Megino] 11:44:39
      if you want to do that, the obvious choices to do that, run simulation.

      [Fernando Harald Barreiro Megino] 11:44:44
      But we are also now starting to experiment with full chain, where where you run the full, all of the all of the tasks within.

      [Fernando Harald Barreiro Megino] 11:44:55
      we can the simulation, same or production thing, and we don't export the intermediate products.

      [Fernando Harald Barreiro Megino] 11:45:01
      But just with, I'm just in the in the plot I wanted to show is but depending on the workload that you are running, you're egress costs. Come by a lot, and that's why we motivate this trying to keep and then the other thing.

      [Fernando Harald Barreiro Megino] 11:45:24
      That we have been experimenting in, the cloud is our announced facility type of setups.

      [Fernando Harald Barreiro Megino] 11:45:31
      with elastic, scaling, so that we set up. I know this is facility with 2 bitter and tasks We keep the like the general components of the running on the cloud to a minimum and only scale, out and a lot of vms when they are requested by a user to

      [Fernando Harald Barreiro Megino] 11:45:49
      run a product does computation, and this is also a very suitable setup for for the cloud, because you just pay for the resources that you are using at the moment

      [Fernando Harald Barreiro Megino] 11:46:03
      then in the next slide

      [Fernando Harald Barreiro Megino] 11:46:07
      So this is the landscape of a close for the Cms.

      [Fernando Harald Barreiro Megino] 11:46:12
      I don't know if Keny wants to talk about it, or me, too.

      [Fernando Harald Barreiro Megino] 11:46:16
      Go through it

      [Kenyi Paolo Hurtado Anampa] 11:46:18
      yes, so in essence, back in 2,016, the Boston I haven't seen monthly.

      [Kenyi Paolo Hurtado Anampa] 11:46:26
      We done a little more to try different call providers to run production workloads, and at the what was done with the young gang team Did you record workloads, But basically shows that if we we can ron any kind of production workflows in the cloud and you can see

      [Enrico Fermi Institute] 11:46:34
      Okay.

      [Kenyi Paolo Hurtado Anampa] 11:46:52
      diagram. They're on the right, bye, and these are when the formula Facility wasn't standard.

      [Kenyi Paolo Hurtado Anampa] 11:47:00
      In order to get twice the number of resources that will be initially at from the global phone.

      [Kenyi Paolo Hurtado Anampa] 11:47:06
      So this is showing like a £150,000.

      [Kenyi Paolo Hurtado Anampa] 11:47:11
      hi! There! On top of the basically will be integrated the resources to kept out that that was also integrated will be gliding 3 as part of, and as as of today, we we we can't use it use this is that there is some work

      [Enrico Fermi Institute] 11:47:34
      Yeah.

      [Kenyi Paolo Hurtado Anampa] 11:47:39
      on to choose this. For example, specialized analysis workloads that depend on machine learning, inference.

      [Kenyi Paolo Hurtado Anampa] 11:47:48
      So there is some to at the

      [Enrico Fermi Institute] 11:47:59
      Okay.

      [Kenyi Paolo Hurtado Anampa] 11:48:01
      Utilize what gpus and to use drone different cloud providers.

      [Kenyi Paolo Hurtado Anampa] 11:48:10
      there is one in France, server, called treatons, that there is.

      [Kenyi Paolo Hurtado Anampa] 11:48:18
      There is that that was also integrated as part of Sonic, And do with that.

      [Kenyi Paolo Hurtado Anampa] 11:48:25
      You can. The third running analysis, the analysis pipeline, Both the machine learning springs through 3 times.

      [Kenyi Paolo Hurtado Anampa] 11:48:37
      cloud providers there, or give using cpus

      [Enrico Fermi Institute] 11:48:41
      I I can put some numbers in. I think they. They ran on 10,000 Cpu core.

      [Enrico Fermi Institute] 11:48:50
      There's 10,000 Cpu cores, and they rented a 100 Gpus and sped up the the workflow was running on the cpus by 10,%, so in that game you basically you invest a little bit in Gpus just speed up the calculation that runs on the on

      [Enrico Fermi Institute] 11:49:05
      the cpus, or third user 10,000, to how many? Gpus? 100, I mean, It's it's early. It's early work so hopefully, that ratio you can reduce that but that Was what they were testing

      [Enrico Fermi Institute] 11:49:20
      Okay? Or comments on landscape of cloud

      [Enrico Fermi Institute] 11:49:30
      Very much to bring out, otherwise we can move on to acquisition operation.

      [Fernando Harald Barreiro Megino] 11:49:35
      okay.

      [Ian Fisk] 11:49:36
      sorry I I have comments and it, and I I I needed. I was thought I was talking sorry.

      [Ian Fisk] 11:49:42
      This is Ian. So the general comment was, We have this issue about the egress charges which I've never, we don't ever seem to have as a solution, for, except not to export data.

      [Enrico Fermi Institute] 11:49:43
      Okay.

      [Enrico Fermi Institute] 11:49:43
      Okay, Got it.

      [Steven Timm] 11:49:56
      no, not so. There are agreements.

      [Ian Fisk] 11:50:03
      But the agreements are always things like it's if it's 15% of the billing charges, we won't like it.

      [Ian Fisk] 11:50:09
      There, there's there's ways to make it reduce.

      [Ian Fisk] 11:50:11
      But at fundamentally this is a This is a business practice that they do to lock, to do, vendor, lock, in, and they're not so.

      [Ian Fisk] 11:50:19
      Far at least, no one's been proposing to not do it.

      [Ian Fisk] 11:50:21
      And so we're always okay.

      [Enrico Fermi Institute] 11:50:23
      2 things: Lanceium does not have egress charges.

      [Ian Fisk] 11:50:26
      Okay.

      [Enrico Fermi Institute] 11:50:27
      So with the limitation that they we're still exploring and that's very early going.

      [Steven Timm] 11:50:28
      Pretty good.

      [Enrico Fermi Institute] 11:50:32
      But by design, at least what they're saying now. They don't charge egress.

      [Ian Fisk] 11:50:37
      Right.

      [Enrico Fermi Institute] 11:50:38
      And then, Fernando, you want to say something about this subscription.

      [Enrico Fermi Institute] 11:50:41
      What? That model is because I

      [Fernando Harald Barreiro Megino] 11:50:43
      I could to discuss that in the tomorrow during the Cloud session.

      [Ian Fisk] 11:50:47
      Okay.

      [Fernando Harald Barreiro Megino] 11:50:48
      But I mean, so basically the agreement. We have the with Google is it's a subscription agreement.

      [Fernando Harald Barreiro Megino] 11:50:57
      And that's basic that's like a flood rate.

      [Fernando Harald Barreiro Megino] 11:51:00
      You agree on a price on the amount of resources that are included.

      [Fernando Harald Barreiro Megino] 11:51:03
      I'm doing will not be touched. Like there is no meter on how much egress you have.

      [Fernando Harald Barreiro Megino] 11:51:08
      You would do, which is a a fixed price for your 15 months of

      [Ian Fisk] 11:51:14
      Yeah. Okay. So at the I guess the the question is, is the at the end of your 15 months, if you want to use the last month only to export your data and get out of the cloud that would be within the confines of the model is that a troops statement

      [Fernando Harald Barreiro Megino] 11:51:32
      was in.

      [Fernando Harald Barreiro Megino] 11:51:33
      Was in. As you are running jobs, the output is always exported, and that's what the we are always running The the eagles cost

      [Ian Fisk] 11:51:40
      Okay, Alright: Okay, it's it's I guess my my point is this is this is this is a fundamental problem, which is that we we can only use the essentially the cloud with a lot. Like Hpc: except that with Hpc: we propose for the data

      [Enrico Fermi Institute] 11:51:42
      Yeah.

      [Steven Timm] 11:51:51
      Yeah.

      [Enrico Fermi Institute] 11:52:01
      Good. Yeah, I mean, what it comes, I mean. But my opinion on the cloud is that the workforce, selection, and capabilities is not the issue yeah, because we we can do anything we want on the cloud it's just the machine you rent the question comes down How what's the cost?

      [Steven Timm] 11:52:22
      Great Great. Well, this one, you

      [Enrico Fermi Institute] 11:52:23
      And How do they structure the pricing price? What they want? 2 and 2 allowed to do, and in what way?

      [Enrico Fermi Institute] 11:52:29
      What are illustrations.

      [Ian Fisk] 11:52:30
      And and there's one and the other thing is my other point I want to make was one of the fundamental differences between sort of Hbc.

      [Ian Fisk] 11:52:37
      And cloud is that Hbc. Relies almost exclusively at at the leadership class on accelerated Gpu style.

      [Ian Fisk] 11:52:44
      Hardware, and that's and it's not the client.

      [Ian Fisk] 11:52:48
      Don't have them but that's the most expensive elements on the cloud, and it's because they depreciate so fast that the cloud providers need to recoup that cost in more in a shorter period of time.

      [Ian Fisk] 11:52:59
      They do for Cpu. Can you find that the the economics of the Gpu and the Cpu are different on the cloud

      [Enrico Fermi Institute] 11:53:09
      It's also structural. I I No, I'll I'll leave that comment because we do have the cloud focus there tomorrow.

      [Ian Fisk] 11:53:15
      Okay, right

      [Enrico Fermi Institute] 11:53:15
      We should not try to have all the discussions now let's have a comment for me.

      [Enrico Fermi Institute] 11:53:20
      Honest.

      [Johannes Elmsheuser] 11:53:22
      yeah, can just to follow up on on the egress right?

      [Johannes Elmsheuser] 11:53:26
      And so if you go one slide back to slide 11, right Fenando has a little bit of for breakdown there.

      [Johannes Elmsheuser] 11:53:34
      of the different costs, right and and it's always, I think, some fear, that egress is really humongous compared to what else right, but from what we are seeing, we to running, for example, on adwords.

      [Johannes Elmsheuser] 11:53:47
      And there doing physics, validation that the egress is not the overall driver here, unless you do really crazy stuff right?

      [Johannes Elmsheuser] 11:53:57
      So when you have a regular simulation task, egress is not dominant, and it's really the Cpu.

      [Johannes Elmsheuser] 11:54:03
      that you are scaling up, that is driving the cost.

      [Johannes Elmsheuser] 11:54:06
      Here it is obviously something that you are using with egress on top us.

      [Johannes Elmsheuser] 11:54:13
      You have to pay compared to Hpc: That's that's no no discussion here.

      [Johannes Elmsheuser] 11:54:17
      But it's also not humongous when when you compare everything in and have to fold everything in here right? I just want to make that statement, and I I think, we can discuss this in more detail than later in the dedicated cloud session

      [Ian Fisk] 11:54:29
      I would claim that I would claim that it was not humongous as long as you're in a very structured environment.

      [Ian Fisk] 11:54:35
      And you are. Be acting a predictable way that the date will be up to analysis, like, at least for us.

      [Johannes Elmsheuser] 11:54:38
      Yeah.

      [Ian Fisk] 11:54:41
      We had a user, So browse some data that we weren't expecting and ran up at $75,000 export bill in a month.

      [Johannes Elmsheuser] 11:54:50
      do I sure I mean that that is then the way how you structure your workflows Absolutely. I I fully agree.

      [Johannes Elmsheuser] 11:54:57
      So, if if you have an agreed workflow there, and here we we are showing production that that's totally clear, right? And you don't want to have the surprises from some unstructured use, analysis, fully agreed

      [Enrico Fermi Institute] 11:55:14
      Is there a comment from Paul

      [Paolo Calafiura (he)] 11:55:16
      yes, I mean I I I I feel I'm becoming and becoming like a broken record.

      [Paolo Calafiura (he)] 11:55:25
      But once again I think this slide shows you the benefits of committing versus versus taking, You know, a handslink approach.

      [Paolo Calafiura (he)] 11:55:34
      So We have always said that the the cloud is a great way to, you know, to do.

      [Paolo Calafiura (he)] 11:55:39
      Excel computeing, like the slide at the bottom kind of suggests, you know, without, you know, when we need something for doing analysis, we will use it.

      [Enrico Fermi Institute] 11:55:39
      Hmm.

      [Paolo Calafiura (he)] 11:55:49
      And then our our loads will be will be elastic, and that's what's expensive.

      [Paolo Calafiura (he)] 11:55:55
      But what? Of course, Once again take the point of view of the band, or one of the vendor ones. Yes, and they want to to lock you in and lock you in, and not necessarily with some evil evil of mechanism, but just by offering you a good subscription deal so that you take some of the money, that

      [Enrico Fermi Institute] 11:55:55
      Okay.

      [Enrico Fermi Institute] 11:56:11
      Yeah.

      [Paolo Calafiura (he)] 11:56:13
      you otherwise would spend on your own. Hard, do it, and give it to them, That's and so there is a lot.

      [Paolo Calafiura (he)] 11:56:19
      There is a lock in there, because, of course, the price is constant for 12 months or 50 months, but it can change from one year to the next, and it will be as if, as it should so, you you are logged in because, then you don't have anymore, let's say all of your pr one or

      [Paolo Calafiura (he)] 11:56:38
      tier, 2 hardware, and then you are locked in with them.

      [Enrico Fermi Institute] 11:56:44
      kosher

      [Kaushik De] 11:56:47
      yeah, coming back to the other point, I'm sure it will be discussed tomorrow.

      [Kaushik De] 11:56:55
      During the dedicated session also, but since it came up, the issue of hydrogen at the in the cloud the heterogeneity is actually extremely useful and extremely good.

      [Kaushik De] 11:57:09
      In the cloud we are using both Amazon and Google for studies with Fpgas with arm, with Gpus, and there is no in setting up those resources because they're already available in the Cloud so I think the usefulness of highly specialized

      [Kaushik De] 11:57:41
      hardware at minimal minimal cost, because we don't pay for setting them up in the cloud.

      [Kaushik De] 11:57:47
      They're already there, but we can go in there, and we can use them, and that is an enormous resource for experiments, because I mean, if we had to set up our own Fpga Farm or arm, phone or or Gpu farm in order to I do some of the studies

      [Kaushik De] 11:58:03
      it Yeah, be private differently expensive.

      [Ian Fisk] 11:58:07
      right, and and and I didn't mean to imply that there wasn't real value in the diversity of resources on the cloud.

      [Ian Fisk] 11:58:14
      I was only commenting that at the production scales that we can can become very expensive

      [Enrico Fermi Institute] 11:58:25
      Are coming from Fernando

      [Fernando Harald Barreiro Megino] 11:58:27
      yeah, it's that question. And so again, now about the egress cost.

      [Fernando Harald Barreiro Megino] 11:58:34
      So. There is always so legend that if there is appearing between, let's say I thought, provide on.

      [Fernando Harald Barreiro Megino] 11:58:43
      For example, Yes, net. You can bring down the egos cost, and I wanted to ask if that's really true, or just something that we had.

      [Fernando Harald Barreiro Megino] 11:58:55
      But no one really about it

      [Enrico Fermi Institute] 11:59:01
      Okay, I think we're gonna definitely have some dedicated time to to talk about that on on Wednesday.

      [Enrico Fermi Institute] 11:59:07
      I know Dale is gonna have a slide or 2 for us, and and maybe we move that question to Wednesday specifically, unless somebody wants to jump in right now.

      [Fernando Harald Barreiro Megino] 11:59:17
      okay.

      [Enrico Fermi Institute] 11:59:21
      comment from what it

      [Alexei Klimentov] 11:59:22
      okay, So my comment is related to comments from in and Paulo, where different comments.

      [Alexei Klimentov] 11:59:30
      So I can disagree, but we use clouds as Hpcs, so we use clouds on completely different ways.

      [Alexei Klimentov] 11:59:40
      This whole idea to try close. Is that what was written on this slide that we can elastically scaling resources.

      [Alexei Klimentov] 11:59:48
      So we can have this difference of resources, and we can build our own architecture at least.

      [Enrico Fermi Institute] 11:59:50
      Yeah.

      [Enrico Fermi Institute] 11:59:53
      Excellent.

      [Alexei Klimentov] 11:59:57
      But Greek. What we have, if especially with Lca.

      [Alexei Klimentov] 12:00:02
      Then you have boundary conditions. When this machine was built, as it was mentioned correctly, not for Hp.

      [Alexei Klimentov] 12:00:09
      But for some our domains, and for Paul or my colleague, we have a cloud.

      [Alexei Klimentov] 12:00:19
      What we have is many years of all experience. I don't think it is the right way to mirror.

      [Alexei Klimentov] 12:00:26
      our understanding of commercial companies to what we are doing with calls right now, so certainly they want to make money.

      [Alexei Klimentov] 12:00:34
      But we're not so stupid, and we are not so stupid to stop our tier.

      [Alexei Klimentov] 12:00:38
      2, and to use just calls, and the whole idea of the 15 months project of bottles is just to learn it better.

      [Alexei Klimentov] 12:00:47
      So I think we are on very early stage with clouds and understanding that you know cost model, and how it can be integrated with our agreed model.

      [Enrico Fermi Institute] 12:00:50
      Okay.

      [Alexei Klimentov] 12:00:59
      With my 2 comments

      [Enrico Fermi Institute] 12:01:05
      So it is 11. We're going to have a short present presentation.

      [Enrico Fermi Institute] 12:01:11
      From? What are you out there

      [Wahid Bhimji] 12:01:13
      Yes. Hello!

      [Wahid Bhimji] 12:01:18
      Yeah, hold on. I'll just move my room right now.

      [Enrico Fermi Institute] 12:01:18
      yeah, okay.

      [Wahid Bhimji] 12:01:22
      I'm just gonna move into a meeting room

      [Enrico Fermi Institute] 12:01:25
      So you had your workshop. Now, where are you discussing this 11

      [Wahid Bhimji] 12:01:29
      just 10, Yes, Yeah, we don't quite that far ahead.

      [Enrico Fermi Institute] 12:01:30
      Oh, that's 10. I got the number

      [Wahid Bhimji] 12:01:36
      Yeah, So it's good good timing to have this conversation.

      [Wahid Bhimji] 12:01:39
      Actually So yeah. So I have a few slides.

      [Wahid Bhimji] 12:01:46
      I don't. I don't necessarily need to talk to them.

      [Wahid Bhimji] 12:01:50
      I wasn't sure if you wanted slides or not.

      [Enrico Fermi Institute] 12:01:56
      you want to share? Can you allow sharing, or the below

      [Enrico Fermi Institute] 12:01:59
      Are you allowed to share

      [Wahid Bhimji] 12:02:02
      Yeah, I think so. Well, it hasn't hang on.

      [Enrico Fermi Institute] 12:02:03
      Oh!

      [Wahid Bhimji] 12:02:05
      I'm just

      [Enrico Fermi Institute] 12:02:06
      Great

      [Wahid Bhimji] 12:02:09
      Let's see, Does that work? You see some window? Yeah, let's see if slide show mood messes it up.

      [Enrico Fermi Institute] 12:02:17
      Yes, yes.

      [Enrico Fermi Institute] 12:02:20
      Right.

      [Wahid Bhimji] 12:02:23
      so. I mean, this is actually just based on some slides I did show at the Cce meeting, or Debbie showed them.

      [Wahid Bhimji] 12:02:29
      So there's no particular news, hey, here, but just to share, like, just to set the context.

      [Enrico Fermi Institute] 12:02:34
      Thank you.

      [Wahid Bhimji] 12:02:36
      And then we can just talk about, you know, whatever you want to talk about.

      [Wahid Bhimji] 12:02:38
      I guess so. This is the current state. Oh, of nurse system.

      [Wahid Bhimji] 12:02:46
      So this shouldn't actually save P. One. Now we have the full of perimeter, both the a 100 accelerated Gpu notes, and Cpu only nodes.

      [Wahid Bhimji] 12:02:59
      but this is still not quite yet in production, as was mentioned briefly earlier, We do have some file system problems in the last stage of kind of upgrading them to use this new singshot high-speed, interconnect there's been a few snags I guess so but

      [Wahid Bhimji] 12:03:15
      those are being resolved, and I'd say it's probably within a month of being at the point of 40 fully available in production.

      [Wahid Bhimji] 12:03:25
      Kind of mode, and, as you probably know, we're gonna we've so far it's been in and early science kind of free, mode, where you don't have to use your allocation in in order. To use.

      [Wahid Bhimji] 12:03:35
      It. But that's coming soon, and then we still have Corey in production, and that is the the main production machine at the moment, and the goal is to retire that at the start of next Yeah, yeah, pending Pomato, actually being fully in production.

      [Wahid Bhimji] 12:03:58
      and so, yeah, there's just a comment here that assign that we do, you know.

      [Wahid Bhimji] 12:04:06
      Look at what user requirements are, while in order to get increase computing resources, you know, it is necessary to move to accelerated notes, as the only way we could offer the kind of increasing performance we need from this machine over the previous machine we do recognize that many communities are not ready for using Gpus for all

      [Wahid Bhimji] 12:04:26
      of their workload. And so that's why there are.

      [Wahid Bhimji] 12:04:29
      Cpu only nodes that actually provide all of the capable ability of Corey.

      [Wahid Bhimji] 12:04:35
      In the these notes. Okay, Yeah, So that's system.

      [Wahid Bhimji] 12:04:41
      This is a bit more of kind of where we're going.

      [Wahid Bhimji] 12:04:43
      We're only gonna have boma to. So there's a bit more detailed on the Cpu notes here, And then, just to say it was on the previous slide as well.

      [Wahid Bhimji] 12:04:51
      But as these file systems that made available, and also, we do put a kind of focus in having connections with external facilities, including other Hpc centers as well, as you know, science for facilities

      [Wahid Bhimji] 12:05:08
      Okay, And then there's been, you know, we've showed this many times that we had this super facility project, And this was really about trying to improve the engagement with data intensive workloads that also need workflow services running alongside that so we have an infrastructure that's

      [Wahid Bhimji] 12:05:23
      kubernetes based for services on the side we have.

      [Wahid Bhimji] 12:05:27
      You know, we put focus in things like Jupiter notebooks that can also run on the big machines, and we're really pushing for federated identity.

      [Wahid Bhimji] 12:05:35
      I mean, that's kind of rolled out now that you can use credentials from of the places to access Nesk.

      [Wahid Bhimji] 12:05:43
      Assuming you have an desk account now, so you kind of put the 2 and hopefully that will be pushed out, and that's come.

      [Wahid Bhimji] 12:05:50
      One of the months later. As part of this infrastructure, integrated research infrastructure task force which is trying to really get kind of cooperation across different centers for these these.

      [Wahid Bhimji] 12:06:04
      Things. So that's just the example with, you know.

      [Enrico Fermi Institute] 12:06:07
      Please.

      [Wahid Bhimji] 12:06:07
      Have type. Workflow Lz. They make you know that we are the primary center for them, and they only center in the Us.

      [Wahid Bhimji] 12:06:16
      So They really have to have all aspects of their workflow working well and desk, and takes a lot of engagement to achieve.

      [Wahid Bhimji] 12:06:26
      I guess this is saying, okay. So we we engage with scientists in lots of ways.

      [Wahid Bhimji] 12:06:32
      So there's a kneesap program, and you know, and listen.

      [Wahid Bhimji] 12:06:35
      Cms. Are both parts of that that help with, you know, can help provide resources to to help to new architectures, and also to explore Ai methods, which is really also a way of using Gpu resources Bo as well.

      [Wahid Bhimji] 12:06:51
      As having it same benefits in terms of transformative change to the way science works, And then we also have the superivity project that is trying to build more workflow stuff so in the future nose turn that I'm just mentioning We have a workshop about Now, internally that

      [Wahid Bhimji] 12:07:08
      we're trying to. It has achieved CD 0.

      [Wahid Bhimji] 12:07:10
      So that means there's a mission need for it. Then we're really putting together an Rfp.

      [Wahid Bhimji] 12:07:15
      Now, which will go out to vendors to kind of bid for a machine here to provide us with the machine.

      [Wahid Bhimji] 12:07:21
      So that's the stage. It's at and and part of the way this has been phrased.

      [Wahid Bhimji] 12:07:25
      The mission need is that that we need a machine to support workflow rather than just applications.

      [Wahid Bhimji] 12:07:32
      So I think that helps the experimental hep community as well, And then I briefly mentioned this thing: The integrated research, infrastructure effort.

      [Wahid Bhimji] 12:07:41
      That is another. Do we wide effort to to build workflow technologies and and support different sentences?

      [Wahid Bhimji] 12:07:51
      I guess this is just the there's 10 mission statement here.

      [Wahid Bhimji] 12:07:56
      Probably there's nothing new you there, and this is just staying again that we expect this machine to really stretch out into Es net and other places, and and provide, you know, way people can run stuff using data from outside.

      [Wahid Bhimji] 12:08:13
      then I just briefly wanted to mention the these. Yes, sure.

      [Enrico Fermi Institute] 12:08:16
      That's a good question on that slide. So that means essentially streaming.

      [Enrico Fermi Institute] 12:08:23
      Then also streaming in and streaming out

      [Wahid Bhimji] 12:08:25
      Yes, So that that comment was made earlier. And there are various use cases not just tep who want to do their including the light sources like you mentioned.

      [Wahid Bhimji] 12:08:37
      so we do anticipate supporting that better in principle.

      [Wahid Bhimji] 12:08:42
      It should be already much better on permanent than it was on Corey.

      [Wahid Bhimji] 12:08:44
      I mean, yeah, don't mention the problems we've had on Corey, which really are never being properly resolved Poem. It already.

      [Enrico Fermi Institute] 12:08:46
      Okay.

      [Wahid Bhimji] 12:08:53
      Should have better capabilities to do this

      [Enrico Fermi Institute] 12:08:59
      Okay, Great: Thanks.

      [Wahid Bhimji] 12:09:02
      Okay, this is just a couple of sides of context as well about you know, the landscape as a whole is getting increasingly challenging with heterogeneity in some ways, there may be advanced from this is the grayson and video grace hopper architecture which has cpus and

      [Wahid Bhimji] 12:09:20
      G. If you use with, you know with you know, access to memory with them.

      [Enrico Fermi Institute] 12:09:22
      Yeah.

      [Wahid Bhimji] 12:09:26
      So in in some sense, this could reduce on data movement costs, and so in some make this easier to program than current architectures.

      [Wahid Bhimji] 12:09:36
      But, on the other hand, this grace is a is an arm Cpu, so there's some, you know already, some differences.

      [Wahid Bhimji] 12:09:42
      There and then. There's also this move to its triplets.

      [Wahid Bhimji] 12:09:48
      Amd, for example, having all kinds of different calls on, there.

      [Wahid Bhimji] 12:09:50
      that's Dpu: So programming and network.

      [Wahid Bhimji] 12:09:54
      And then there's all these Ai hardware, specific architectures, and then a bit longer term There's the idea of processing in storage, and there's also a move we see on the nurse 10 time frame just kind of coming in towards disaggregation which

      [Wahid Bhimji] 12:10:09
      potentially allows more efficient use of resources. So this is the idea that you could have a a disaggregated memory pool which gives you increased memory capacity, but not on the note, So you would you'd be incorporating memory from outside the note but that means that people who need

      [Wahid Bhimji] 12:10:27
      much higher memory capacity would be actually be able to access that without I was having to buy kind of that in every single node So there's opportunities here.

      [Wahid Bhimji] 12:10:37
      But also quite complex landscape. And then you know, there's this rise of the cloud market that really is driving everything so you know, it's very lightly that so this is an opportunity of course because we can capitalize on all this investment going into cloud interfaces and so forth but it means

      [Wahid Bhimji] 12:10:56
      that you know we also have to recognize that in the kind of machines that we have access to so, and we can expect that these interfaces will become the standard way of accessing machine.

      [Wahid Bhimji] 12:11:11
      So. So this is also good. I think it means that if you use these cloud interfaces, then there's, you know, probably a good expectation that these should be what we.

      [Wahid Bhimji] 12:11:25
      Should we should definitely work with the other compute centers to make sure these are well supported at the very compute centers

      [Wahid Bhimji] 12:11:34
      and this is just one slide on, I mean, since this was the Hpc.

      [Wahid Bhimji] 12:11:38
      And Cloud workshop. I just thought about this like that. You know.

      [Enrico Fermi Institute] 12:11:39
      Yes.

      [Wahid Bhimji] 12:11:41
      We're already kind of using that. In there, as mentioned in the spin services that set on the side.

      [Wahid Bhimji] 12:11:46
      But we're increasingly seeing a tighter integration to the main system.

      [Wahid Bhimji] 12:11:51
      And so I expect on Nurse 10 there'll be an increasing ability to use cloud type interfaces to access the big supercomputing resources as well.

      [Enrico Fermi Institute] 12:12:04
      Okay.

      [Wahid Bhimji] 12:12:04
      No, it's good. At least. Okay. So I think that's all I really had.

      [Wahid Bhimji] 12:12:10
      This one is just about also data management. I think we see also having an increased role here, in the nurse town timeframe, which I think should also open its community But again, and then this is probably a general point without a thought as the discussion was going on earlier that we do have to cater for a very

      [Wahid Bhimji] 12:12:29
      wide community. So that's one of the maybe disadvantages we have compared to the leadership computing facilities that we do try to support different user communities.

      [Wahid Bhimji] 12:12:39
      But we have, you know, thousands of users and several 100 projects that have different needs.

      [Wahid Bhimji] 12:12:46
      Some of them are traditional. Hpc Center Hpc projects so they need, you know, tightly coupled, large scale resources.

      [Wahid Bhimji] 12:12:54
      some are more similar to experimental help, but have their own.

      [Wahid Bhimji] 12:13:00
      You know their own ways of doing things there like a little bit different to how experimental help is doing it And so we have to kind of come to some sort of balance of supporting all of these.

      [Wahid Bhimji] 12:13:12
      Okay, Okay, I think that's me. Yeah, any, any questions.

      [Enrico Fermi Institute] 12:13:18
      Thanks for you. I have one question. I think you mentioned that the nurse 10 is going to have a lot of you know, accelerators, for performance and things like that.

      [Enrico Fermi Institute] 12:13:29
      Yeah, what do you do? You guys have any feeling for what the mix will be?

      [Enrico Fermi Institute] 12:13:34
      Of of accelerators and cpus. And the next machine.

      [Wahid Bhimji] 12:13:39
      Oh, well, we don't! And we're having that discussion So so one thing and and this, this this other things that might come into play here as well, because so I mean, I think you can guarantee that there will be some gpus in this machine pretty pretty.

      [Enrico Fermi Institute] 12:13:41
      Okay.

      [Enrico Fermi Institute] 12:13:41
      Yeah.

      [Wahid Bhimji] 12:13:54
      Much realistically. That will be the most slightly, you know, generally usable accelerator.

      [Wahid Bhimji] 12:13:59
      That's right. Today. Then I mentioned there was these disaggregation technologies, and also several of the vendors are talking about like multi-tenancy, and so forth.

      [Wahid Bhimji] 12:14:10
      So. It is possible that you know one could run the Cpu only workload a lot died without any dedicated cpu, any nodes, so that would be a judgment on whether that technology really allows that and whether it that would provide sufficient, resources.

      [Wahid Bhimji] 12:14:31
      So those codes that was super gpu heavy, and accelerated would like leave enough of the Cpu, too, allow other jobs to run on there that you, Cpu, only but anyway, it is that certain part of the community even on the 2026 time scale won't be ready

      [Wahid Bhimji] 12:14:50
      for accelerated only so, you know, they will continue to be some Cpu resource, and then on the more exotic accelerators, I think it it is likely that we will have Yeah, Well, we will in the rfp have some place that people can which ai

      [Enrico Fermi Institute] 12:14:52
      Okay.

      [Wahid Bhimji] 12:15:10
      accelerators. For example, you know whether those are offering a significant benefit above Gpus, I don't think is yet clear.

      [Wahid Bhimji] 12:15:20
      A minute. I don't think they particularly are now, but they may do.

      [Wahid Bhimji] 12:15:24
      On the 2026 time scale. But there the Ai workload, you know, is currently not a very big fraction of what we're running, and so it would, you know, have to be sized, accordingly and I would say, on.

      [Wahid Bhimji] 12:15:38
      The integration with cloud. We're also looking at. As the point was made earlier, There's a huge variety of technology on the cloud, and even though we we tried to deploy cutting edge technology you know, Obviously, there quicker, to deploy various new technologies, so it.

      [Wahid Bhimji] 12:15:53
      May be that we can, You know, partner, with cloud providers to provide some of this capability for experiments and that particular workloads that need to run on different accelerators

      [Enrico Fermi Institute] 12:16:11
      But he would it be fair to say that we shouldn't expect significant scale up of the Cpu because I look at what Corey, the pro mode the Cpu basically stayed pretty much flat, one less because you the cpu fraction of parameters it somewhat equivalent of performance

      [Wahid Bhimji] 12:16:21
      Hmm.

      [Enrico Fermi Institute] 12:16:30
      to to what what we had on Corey. And just because of power budget reasons, I I wouldn't expect that, like 10 gives us 3 times the cpu.

      [Enrico Fermi Institute] 12:16:40
      That's in problem. I don't know. Probably most of yeah, okay.

      [Wahid Bhimji] 12:16:41
      right.

      [Wahid Bhimji] 12:16:42
      Right? Yeah, I think that's in terms of Cpu only resources.

      [Wahid Bhimji] 12:16:47
      I think that would be a reasonable expectation

      [Enrico Fermi Institute] 12:16:50
      Good.

      [Enrico Fermi Institute] 12:16:58
      Other questions for what you more short-term technical one.

      [Enrico Fermi Institute] 12:17:06
      So for data transferring out globus is not the be all at end.

      [Enrico Fermi Institute] 12:17:10
      All for Lhc. I know that there was some work to do, something with X room

      [Wahid Bhimji] 12:17:17
      Yeah, So that's still ongoing. I mean.

      [Enrico Fermi Institute] 12:17:20
      How's that? How's that going

      [Wahid Bhimji] 12:17:23
      Well, I I mean, I think yeah, we still working on it, right?

      [Wahid Bhimji] 12:17:28
      I mean, it's got a bit slower now, but I think we are trying to do that, and I think it would particularly of both Atlas and Cms can use the same interface and also other.

      [Wahid Bhimji] 12:17:39
      You know, help experiments, and even potentially, the light sources.

      [Wahid Bhimji] 12:17:42
      Then it's something worth us putting effort into support. I also think we need to.

      [Wahid Bhimji] 12:17:49
      So at the moment the spin kind of these containerized services haven't been like optimized for using for data management, services, but I think that's another thing that we should be able to support in the longer run that will add up people to run all kinds of different things on that side I mean globus

      [Wahid Bhimji] 12:18:08
      is for us the best, you know. Multi. It was supported by the most number of other communities that it's really worth us putting an effort into support.

      [Wahid Bhimji] 12:18:20
      But yeah, I do appreciate that. Not everyone uses it.

      [Wahid Bhimji] 12:18:23
      And so we do need other other things. I did have a brief chat.

      [Wahid Bhimji] 12:18:26
      I saw I am Foster actually conference a couple of weeks ago, and so that did have a brief chat with him.

      [Wahid Bhimji] 12:18:34
      I think power is always well, but about about ways we can maybe improve global and D kind of interoperation.

      [Wahid Bhimji] 12:18:43
      but that was no more than a chat at this point, but he seemed open to more discussions on that front

      [Enrico Fermi Institute] 12:18:52
      And it's probably not, for with this talk back going, we can chat later on that we're we're like.

      [Enrico Fermi Institute] 12:18:59
      Technically, we were stuck on Thursday things. But it's yeah for we can talk over.

      [Enrico Fermi Institute] 12:19:07
      One should be in time. Okay, other questions for anybody else.

      [Enrico Fermi Institute] 12:19:16
      Anybody on zoom

      [Enrico Fermi Institute] 12:19:24
      By the way, just to we had this plan for the afternoon, for the Hpc focus area but due to the ongoing workshop there was a little bit of a scheduling conflict here.

      [Enrico Fermi Institute] 12:19:35
      So we okay, alright.

      [Wahid Bhimji] 12:19:35
      Yeah, so I won't be around in the afternoon. So if you wanna attack me, you should do it now and but yeah, we'll be interested in also seeing the blueprint.

      [Enrico Fermi Institute] 12:19:42
      considering.

      [Wahid Bhimji] 12:19:45
      As well once you have it, or whatever, because I think that will help, you know, as was mentioned

      [Enrico Fermi Institute] 12:19:49
      I'm not. I mean the level. It's probably not gonna be fully public, but there might be a version of it that's going to be public.

      [Wahid Bhimji] 12:19:57
      Right? Yeah, I mean again for influencing kind of architectural decisions.

      [Enrico Fermi Institute] 12:19:57
      We'll have to see what

      [Wahid Bhimji] 12:20:02
      I mean, it's really when we're evaluating the Rfp.

      [Wahid Bhimji] 12:20:05
      And stuff that we can bring in these considerations

      [Enrico Fermi Institute] 12:20:09
      So? Are you looking at things like the very low power course like our?

      [Wahid Bhimji] 12:20:14
      Yeah, I mean, you know, in video, want to sell you this and this great hopper architecture now.

      [Wahid Bhimji] 12:20:22
      So they're setting arm cpus with the so at least with the Gpu.

      [Wahid Bhimji] 12:20:28
      So, at least for the Gpu accelerated notes.

      [Wahid Bhimji] 12:20:31
      If they're Nvidia, then it would they would be on, and they also sell Cpu only, or will do so

      [Enrico Fermi Institute] 12:20:39
      That's which

      [Wahid Bhimji] 12:20:43
      but you know again, it depends on the workload, and if for the Cpu only nodes, I mean, given the communities involved, you know many of them are not that flexible.

      [Wahid Bhimji] 12:20:53
      So it may be that that doesn't really make sense for Cpu.

      [Wahid Bhimji] 12:20:57
      Any notes to have on

      [Enrico Fermi Institute] 12:21:04
      Okay, Anything else.

      [Enrico Fermi Institute] 12:21:08
      Okay, what do you Thank you so much for attending? Appreciate the presentation

      [Wahid Bhimji] 12:21:10
      Thanks anyone.

      [Enrico Fermi Institute] 12:21:19
      So slides again. I need to share them.

      [Enrico Fermi Institute] 12:21:32
      Do you want to do these? Or here? I can go through them.

      [Enrico Fermi Institute] 12:21:37
      So one other thing questions on the charge was what metrics should be used to decide with a work for this executed efficiently, both in how we acquire the resources and and also then how we operate the work through zoom is it different is it efficient to get a certain resource to basically spent the effort to to get

      [Enrico Fermi Institute] 12:22:01
      it That's a cost to get it, and then actually to run one out, work through science and acquiring in this context means 2 things: one and like one, is is to actually get access to the resources which on Hbc: And Cloud we're talking about the proposals, and hpc competitive proposals, where you put

      [Enrico Fermi Institute] 12:22:22
      it. Usually this year at the moment, is his yearly proposals.

      [Enrico Fermi Institute] 12:22:27
      You have to follow a certain procedure of every Hbc.

      [Enrico Fermi Institute] 12:22:29
      Facility is different. The exceed slash access There's like an umbrella organization where you can ask for time on multiple facilities.

      [Enrico Fermi Institute] 12:22:37
      These in one proposal, but others are unique for you.

      [Enrico Fermi Institute] 12:22:43
      One facility, and on cloud is either you just pay a you go, pay whatever this price, or on demand, or spot, or preempt.

      [Enrico Fermi Institute] 12:22:53
      So whatever the instruct is called, but is publicly available.

      [Enrico Fermi Institute] 12:22:55
      Pricing to everyone If you show up the credit card you can get it, or like what athletes is doing right now.

      [Enrico Fermi Institute] 12:23:04
      A subscription based on a negotiation, basically commit, we commit to a certain amount of money, and you get a certain block of resources with limitations and rules, how you can use them and how things are going on and the second part of the acquiring part is the extra provisioning once someone gives you

      [Enrico Fermi Institute] 12:23:23
      the key. Basically, here are the resources you actually have to figure out.

      [Enrico Fermi Institute] 12:23:27
      How do you actually tire them into our systems so that you can make use of them?

      [Enrico Fermi Institute] 12:23:34
      So at the Hpc. Level. It's like things like batch, Hbc: badge, queues unit of provisioning a number of notes scheduled Our policies all of that comes into play because it's all different than what we I used to on our own resources.

      [Enrico Fermi Institute] 12:23:51
      That we own where we have a fixed quarter. We say you get 4,000. Course.

      [Enrico Fermi Institute] 12:23:55
      Okay, There might be a 20 four-hour way to test me back to other people.

      [Enrico Fermi Institute] 12:24:01
      But eventually, if you provide a stable, I'm basically a sufficient amount of work will always give you 4,000 courts.

      [Enrico Fermi Institute] 12:24:09
      That's different on the Hpc. It's like you don't have any guarantees there.

      [Enrico Fermi Institute] 12:24:14
      And then cloud cloud is less problematic in terms of provisioning, because you you pay over your money, but you still have depending on what pricing model and what rules you follow.

      [Enrico Fermi Institute] 12:24:26
      You can still have to deal with contention with insertion.

      [Enrico Fermi Institute] 12:24:29
      Certain regions side which depend on the size of the regions.

      [Enrico Fermi Institute] 12:24:35
      Activity of other customers, What certain instant types you request, and so on.

      [Enrico Fermi Institute] 12:24:40
      Yes, and then, once you have the resources and they are available, and you provision them and they're integrated.

      [Enrico Fermi Institute] 12:24:49
      Then you look at what metrics are interesting to determine whether you actually operate efficiently.

      [Enrico Fermi Institute] 12:24:56
      There standard one we use everywhere. Cpu efficiency gpu efficiency.

      [Enrico Fermi Institute] 12:25:05
      It's Basically, nothing. It's an open question. We We don't have anything that measures how efficiently we use the Gpu

      [Enrico Fermi Institute] 12:25:14
      On the cloud what it eventually comes down to is the dollar per event, or the dollar in paid per hsu 6.

      [Enrico Fermi Institute] 12:25:21
      You get. Hbc: there's no direct all right.

      [Enrico Fermi Institute] 12:25:27
      Cost associated. So the outlayers 0 in terms of monitory, but of course, is not free, and effort.

      [Enrico Fermi Institute] 12:25:36
      Then you look at overall utilization. So if you have a certain product number of cloud credits, or if you have a certain allocation size, are you using these up because he spent some effort to get them, so you should use them, like so subscription.

      [Enrico Fermi Institute] 12:25:52
      Model, If If Google lets you use 10,000 cores for free as part of the subscription, it doesn't make sense to only 1,000.

      [Enrico Fermi Institute] 12:26:00
      There is no benefit on a penalty for using up your phone.

      [Enrico Fermi Institute] 12:26:04
      Yes, the other thing is turnaround time. So now run Time, I mean, is provisioning.

      [Enrico Fermi Institute] 12:26:13
      Turn around this is, comes especially into Hpc: If you talk about Lcf: if you talk about unit of provisioning what's associated with that is also.

      [Enrico Fermi Institute] 12:26:26
      The latency is very. We are used to that with our normal grid operations.

      [Enrico Fermi Institute] 12:26:33
      I'll see if you ask for a 1,000 notes. You can wait and you have no idea when you're gonna get it.

      [Enrico Fermi Institute] 12:26:39
      Eventually you'll get it It's not under your control.

      [Enrico Fermi Institute] 12:26:44
      and for all these metrics is how are we gathering them?

      [Enrico Fermi Institute] 12:26:48
      Okay, I mean our own resources. We have services in place. We have many years of preparation, so they have to see.

      [Enrico Fermi Institute] 12:26:56
      Just get them to forward information, to collect it. Hbc.

      [Enrico Fermi Institute] 12:27:00
      Cloud is different, especially Hbc. On the cloud. You can run whatever.

      [Enrico Fermi Institute] 12:27:04
      But Hbc is problematic. You need to collect statistics from the best use from the job system and so on.

      [Enrico Fermi Institute] 12:27:13
      I'll be how you forward it. So it's actually collected in the right place, and you can compare it.

      [Enrico Fermi Institute] 12:27:19
      You have phones. Texas? We got a question. Yep.

      [Enrico Fermi Institute] 12:27:24
      yes.

      [Ian Fisk] 12:27:25
      yeah, I had a question which was about the concept of You have nothing about Gpu efficiency.

      [Ian Fisk] 12:27:31
      If you have nothing but deep efficiency, it's just you haven't asked the the the Gpus themselves monitor.

      [Ian Fisk] 12:27:38
      Their utilization. Very well. Command is Nvidia Smi. They will tell you how much the memory and how much the theoretical processing capacity you're using

      [Enrico Fermi Institute] 12:27:49
      I'm nuts. I'm not saying that is nothing that you can run on a Gpu to tell you to what degree it's utilized.

      [Enrico Fermi Institute] 12:27:56
      What I'm saying is, I don't think we have any tool where, with our Gpu workforce, where we actually record this information, and keep track of it.

      [Ian Fisk] 12:27:56
      Okay.

      [Ian Fisk] 12:28:06
      Okay. But but at this point, in the same way that you that you record the Cpu efficiency with top you should. Simply There are tools that do exactly the same thing for Gpu

      [Enrico Fermi Institute] 12:28:18
      And

      [Enrico Fermi Institute] 12:28:18
      Yes, it just needs to go. Put it place and put into demonic traces, and it's it's more an indicator of how early we are in terms of adoption of gpu workflows in the experiments.

      [Ian Fisk] 12:28:19
      Okay.

      [Enrico Fermi Institute] 12:28:33
      Then It is an indicator of the lack of low level 2.

      [Enrico Fermi Institute] 12:28:36
      So all of the tools are there. It's just a matter of spreading it all through.

      [Ian Fisk] 12:28:39
      Yeah, I just this: This is the like, I can tell you.

      [Ian Fisk] 12:28:43
      I send a lot of email a week about people who are not using the Gpus especially.

      [Ian Fisk] 12:28:47
      Well, and so I It's probably something that should go early on It's a monitoring system, because it it.

      [Ian Fisk] 12:28:55
      It's not like it's not it's not.

      [Ian Fisk] 12:28:56
      It's hard to get

      [Enrico Fermi Institute] 12:28:58
      One thing

      [Enrico Fermi Institute] 12:28:59
      One thing conceptually, that that is not quite as mature as you know what these different things mean to be.

      [Enrico Fermi Institute] 12:29:07
      You're comparing cross sites.

      [Enrico Fermi Institute] 12:29:12
      How do you compare 1080 versus a a 100, or what it being maybe that example, you just round the the 1080 down to 0 have the problem solve.

      [Enrico Fermi Institute] 12:29:22
      But but trying to aggregate and cross. Compare to, could it?

      [Enrico Fermi Institute] 12:29:29
      I mean eventually one of the things you're asking here is if I shooting my money's worth.

      [Enrico Fermi Institute] 12:29:36
      And to be asked in many different directions, including from the from the site.

      [Enrico Fermi Institute] 12:29:41
      I wanted to start. Do that. You start to accounting, and it's these sorts of things aren't as accepted as what

      [Ian Fisk] 12:29:45
      right, but I

      [Ian Fisk] 12:29:52
      Right, but they're like one of the reasons why we have Hs O.

      [Ian Fisk] 12:29:57
      6 was that we had a variety of cpus weren't sure what the performance was going to be between them and this in Benchmark and figure out the relative capacity each of the sites, and it's not intrinsically more difficult than that and just there there's a wider there's a much

      [Ian Fisk] 12:30:12
      wider variation in the performance of Gpus

      [Enrico Fermi Institute] 12:30:17
      We need a Hso. 6 for Gpus. Maybe

      [Ian Fisk] 12:30:20
      Maybe if you need just 26. But yeah.

      [Enrico Fermi Institute] 12:30:25
      Another thing is, once you collect this, it's it's not good to me that we know, I mean.

      [Enrico Fermi Institute] 12:30:29
      And Cpu efficiency. We kind of have an idea of what's bad I mean.

      [Enrico Fermi Institute] 12:30:34
      Usually it's bad when we get dinked by the review balls that look at our Cpu efficiency and tell us we you make bad use of the resources for you It's It's been. Unclear to me on the gpu side what is bad day I would say we don't have any clue

      [Enrico Fermi Institute] 12:30:50
      on the Cpu side, but we pretend we do. We have about the same amount of clue on We have even less, because from you know, architecture generation to architecture, generations.

      [Ian Fisk] 12:30:59
      right.

      [Enrico Fermi Institute] 12:31:05
      Of Gpus. Things are changing pretty wildly, and the performance basis is very, very different, for you know, a turing class chip versus Okay, So the conclusion is we need to learn to pretend to know what we're doing exactly we need to come up with, sufficiently.

      [Enrico Fermi Institute] 12:31:22
      Obsceneated language, so that we can sound like we know what we're talking.

      [Enrico Fermi Institute] 12:31:25
      About; the fact that it's 2,022 in our review boards.

      [Enrico Fermi Institute] 12:31:31
      Don't understand hyperthread. No, that suggests that 2,040, maybe we'll have Okay.

      [Enrico Fermi Institute] 12:31:46
      So I guess, or sorry. The 1 one thing I wanted to say is that I I, in terms of like even coming up with performance, benchmarks.

      [Enrico Fermi Institute] 12:31:55
      I don't know if it makes a lot of sense to compare how you're doing with a bunch of tennis versus how you're doing with a bunch of amperes or a a 100, or whatever, just because the way those peopleors work, is so different yeah that does

      [Enrico Fermi Institute] 12:32:10
      cut my point. Sorry Sorry we agree. Great! That's awesome.

      [Enrico Fermi Institute] 12:32:15
      you don't just tell me this next slide, but particularly when it comes to acquisition at some point, you have to feedback to the the powers that be how you spend the money whether it was funny money or real money and that starts to get into things like pledging and you know actually having

      [Enrico Fermi Institute] 12:32:35
      these resources effectively, more effectively, acknowledgment experiments.

      [Enrico Fermi Institute] 12:32:42
      Or we're gonna touch that third rail today, or like this is, there's an accounting, pledging discussion on on Wednesday morning, and part of benchmarking is part of that, because, yeah, as you said, the thing is this is Hbc as long as it's opportunistic.

      [Enrico Fermi Institute] 12:32:57
      no one cares. It's free resources. When people actually start pledging it.

      [Enrico Fermi Institute] 12:33:02
      Then then you get into comparing performance numbers, and are you meeting your pledge or you're not meeting your pledge, And then things like measuring these things correctly, or at least in the measuring them in the way that you come up with a defensible number Okay, but good one to 4 one interesting

      [Enrico Fermi Institute] 12:33:23
      thing I observe on the Wcg. Side this month is, especially as some some sites in Europe are saying that we, you know we can't keep.

      [Enrico Fermi Institute] 12:33:33
      The use running over the winter but we'd like to send, you know, have the same number of hours delivered, which, of course, courses not what we pledge on So I think there's gonna be more for interest and examining some alternate models where interest wasn't

      [Enrico Fermi Institute] 12:33:51
      before, But I I think we we really need to push.

      [Enrico Fermi Institute] 12:33:56
      having things like Hpcs. Quote a quote cap and not right right now.

      [Enrico Fermi Institute] 12:34:03
      The the value delivered officially to the experiments is rounded to 0, even though we know from the resource graphs, these are been delivered lots and months of in terms of events at that that's gonna break some point, and the fact that some of the traditional wcg

      [Enrico Fermi Institute] 12:34:25
      sites, or also hit the brakes on the old pledge Models might be as close as you could do in turn turn it into an option Yeah, The 2,021 see it the Crsg: and that the comp basically where we added, up what we actually delivered 2021.

      [Enrico Fermi Institute] 12:34:44
      Us Hbc was slightly above formula. Now, okay, there's the normalization.

      [Enrico Fermi Institute] 12:34:49
      Factors have larger Arabs. So it was basically comparable.

      [Enrico Fermi Institute] 12:34:53
      But but again right now at view from some angle, you're you're saying.

      [Enrico Fermi Institute] 12:35:01
      We delivered as much as Fermi lab did, but then the value was written to, 0 and because none of it quote official accounts.

      [Enrico Fermi Institute] 12:35:09
      And that's that's a problem. Now when the problem gets increased by bigger, Yeah, the question about the turnaround.

      [Enrico Fermi Institute] 12:35:16
      So the turnaround time with some set Hpc centers, allow you to reservations Where you plan ahead.

      [Enrico Fermi Institute] 12:35:26
      Does that change, Then some of these metrics, and then also the also the Is it also simple?

      [Enrico Fermi Institute] 12:35:35
      Simplify operationally the I can just tell you the experience we had.

      [Enrico Fermi Institute] 12:35:44
      We have used reservations for Cms and that's mainly for the reason that the type of work for we're sending does always work so we don't really care when it runs.

      [Enrico Fermi Institute] 12:35:57
      The I know that some of the neutrinos and science experiments, where they had a big specific production that you targeted in Hbc that they had scheduled They planned ahead makes perfect sense to do A reservation in that Scenario for us I don't see that it.

      [Enrico Fermi Institute] 12:36:14
      Would help as much, because the the turnaround time is not so much a problem in terms of

      [Enrico Fermi Institute] 12:36:24
      Basically not being able to plan work because most of our work we don't care if it runs this week or next week, or a couple of weeks later.

      [Enrico Fermi Institute] 12:36:31
      I mean, there's higher. There's high priority stuff, but we usually do it at soon, and then play with the prioritization.

      [Enrico Fermi Institute] 12:36:37
      The turnaround time is actually more an issue for us in our software stack, because the system is just not designed with with like a 2 week provisioning time, and all week provisioning the assumptions in there So this is more software.

      [Enrico Fermi Institute] 12:36:53
      Problem than then an actual work plan planning problem, movement.

      [Enrico Fermi Institute] 12:37:00
      It's still a useful metric to have, because it doesn't right.

      [Enrico Fermi Institute] 12:37:05
      So I mean, you cannot. If you need a week, you cannot put high priority stuff.

      [Enrico Fermi Institute] 12:37:08
      There that's that's relatively small invitation, because most of our work is not high priority, No matter if these things had, that's a different issue.

      [Enrico Fermi Institute] 12:37:19
      But most of the work is just just get it done. We come back later on a month and check that everything is done

      [Enrico Fermi Institute] 12:37:30
      Okay.

      [Enrico Fermi Institute] 12:37:34
      Other comments or questions from Zoom: Yeah, yes, Okay, there any questions that we didn't that we should be asking

      [Enrico Fermi Institute] 12:37:51
      I mean, just to go hit the the dead horse again. I think the the accounting porting of resources has to be a a top level item

      [Enrico Fermi Institute] 12:38:07
      That's got to be appropriate, zoom. So not a particularly interesting technical topic, Tv.

      [Enrico Fermi Institute] 12:38:18
      But that's okay. I mean, we we did talk a lot about this in in context of Hpc: Were there any specific comments folks wanted to make about this on cloud

      [Enrico Fermi Institute] 12:38:33
      We save some of that for the discussion tomorrow

      [simonecampana] 12:38:36
      sorry? Can I ask a question? This is the morning I It's following up on what Brian said.

      [Enrico Fermi Institute] 12:38:38
      Yep.

      [simonecampana] 12:38:43
      I think it would be interesting. In fact, if those resources that today are a bit special, they might not be in the future, could be accounted properly, which means basically being reported back.

      [Enrico Fermi Institute] 12:38:45
      Okay.

      [simonecampana] 12:38:57
      Who's that official accounting tools we use? Do you understand that?

      [simonecampana] 12:39:02
      So is. The problem is, I didn't get from the discussion is the problem technica.

      [simonecampana] 12:39:07
      It's well understood how to do it, but someone has to do the work because the vergeical ways, for example, of integrating Hpcs.

      [simonecampana] 12:39:16
      if you use an engine like, head cloud them, you can put some of the intelligence there, and the report upstream.

      [simonecampana] 12:39:24
      Your accounting records. But if you have something like a direct integration of the Hpc.

      [simonecampana] 12:39:29
      With the workload management system of the experiment. Like, for example, at the way Atlas is doing, you, You don't have that gateway.

      [simonecampana] 12:39:39
      You don't have that service you need in practice, Banda, or the workload management system, whatever.

      [simonecampana] 12:39:46
      That is for report upstream. So I think it's a good idea to look into that.

      [simonecampana] 12:39:51
      Do you have view of how to do it?

      [Enrico Fermi Institute] 12:40:01
      In terms of the typical pieces, I'm not so so worried. All right.

      [Enrico Fermi Institute] 12:40:04
      We've We've done this several times across multiple generations of technology.

      [Enrico Fermi Institute] 12:40:09
      So we we not like. It's the first time we we had to do an integration like that in the last 5 years.

      [Enrico Fermi Institute] 12:40:15
      Again. My My my worry is if we come in and say, you know, Oak Ridge delivered a 100 million Cpu hours to Atlas.

      [Enrico Fermi Institute] 12:40:29
      you know. How does that get contributed as part of a delivered resource to to the experiment? Yeah, how do we form I can say this, then, Does that help meet The us's commitments to the W Lcg: because right?

      [Enrico Fermi Institute] 12:40:46
      Now, it's it's a very different saying that.

      [Enrico Fermi Institute] 12:40:48
      Okay, a resource with W. To Gw: And that, did you?

      [Enrico Fermi Institute] 12:40:53
      Making that count, and that the official we have a touch that in 2 decades versus the technical mechanism to get an integer for pointing.

      [simonecampana] 12:40:59
      okay.

      [Enrico Fermi Institute] 12:41:04
      Yeah, we seem to reinvent that over 5 years or so.

      [simonecampana] 12:41:08
      No, I see. So it's basically a question of policy you are making, which is a good one.

      [Enrico Fermi Institute] 12:41:11
      Got it. It's

      [simonecampana] 12:41:14
      has to do a bit with the what the experiment considers the pledge resource, and a lot of the spgaments. We're considering the pledge of source something that they can use to run any workflow in a transparent way So I think, as long as one goes in this

      [simonecampana] 12:41:29
      direction you will get the buy-in from an experiment.

      [simonecampana] 12:41:33
      Otherwise there might be some discussions to have

      [Enrico Fermi Institute] 12:41:37
      Oh, I think there has to be some discussion, because you're I don't think for any, with an exception, Maybe the cloud.

      [Enrico Fermi Institute] 12:41:46
      And even then landscape is a good counter example, where it probably can't run any so.

      [Enrico Fermi Institute] 12:41:52
      but but to say that yeah, the experiment site a 1 billion Cpu hours, or again, just making up numbers on overage is worth nothing, because we can't run everything on there, I think, is pretty short-sighted but it is a very important discussion and you know what I find is policies that are

      [Enrico Fermi Institute] 12:42:14
      older. It's tend to be harder to update the fact that we have a really dug into this in 20 years means it's it's gonna take some effort to come to a a place where everybody is happy and feel that they're concerns are heard.

      [simonecampana] 12:42:34
      yeah, I have. I I see your point. I think there are a lot of problem.

      [simonecampana] 12:42:39
      Is that there is a large spectrum right? There are Hpcs that can be used almost for everything which you can say sort of a pleasant source.

      [simonecampana] 12:42:46
      There are Hbc's that can be used to run one generator.

      [Enrico Fermi Institute] 12:42:50
      Okay.

      [simonecampana] 12:42:51
      It's a bit short sighted to say that those are like any other facility.

      [simonecampana] 12:42:56
      So, and because the spectrum is broad is difficult to.

      [Enrico Fermi Institute] 12:42:59
      Please.

      [simonecampana] 12:42:59
      I agree with you is something that

      [Enrico Fermi Institute] 12:43:08
      Going back to the acquiring for use specifically for Hpc: So right now, you you mentioned that for Hpc.

      [Enrico Fermi Institute] 12:43:18
      Every you know, there are a couple of different kinds of proposals, suspicion types, and if it's leadership class or or user facility, so these still require proposals which each it each the border each year, or and and so forth, and and to me it seems like you can't, say anything you

      [Enrico Fermi Institute] 12:43:39
      can't tie that to any sort of pledge situation.

      [Enrico Fermi Institute] 12:43:43
      If you've got a proposal that has to be proved by a bunch of outside scientific, you know, Committee right?

      [Enrico Fermi Institute] 12:43:51
      There was something we had in affirmative action This I'm not on this.

      [Enrico Fermi Institute] 12:43:58
      But something came up. Lis mentioned something that they were high-level discussions within the agencies about better support for data sciences.

      [Enrico Fermi Institute] 12:44:08
      that was one of the area of discussion She didn't say anything, about.

      [Enrico Fermi Institute] 12:44:12
      At least there's discussions I don't know if it's to what extent that will go anywhere.

      [Enrico Fermi Institute] 12:44:19
      The one thing that with this here is nice proposal.

      [Enrico Fermi Institute] 12:44:24
      They've started out asking the last couple of years about special needs, and like multi year planning horizon.

      [Enrico Fermi Institute] 12:44:31
      And it seems this year they kind of they already know what they're going to give us for the next 3 years It almost sounded like the feedback but we still have to write a proposal, but they ask us to write a simple proposal so on the ns, on the nsf side we're

      [Enrico Fermi Institute] 12:44:47
      starting, to see, not not at like the biggest Frontera type skills.

      [Enrico Fermi Institute] 12:44:53
      Nsf: Start to give allocations as part of the yeah crypt proposal.

      [Enrico Fermi Institute] 12:45:02
      So if you want a proposal like Uscms ops it.

      [Enrico Fermi Institute] 12:45:05
      It comes with the allegation as opposed to having Soma other, you know.

      [Enrico Fermi Institute] 12:45:10
      Peer, review Committee. They basically do this as double jeopardy that they could give you the money and then have somebody else make you unable to. To.

      [Enrico Fermi Institute] 12:45:24
      Okay, So that that's beginning to go into the system, but not at the Us.

      [Enrico Fermi Institute] 12:45:33
      Lhc: Ops: skills. So you know, there's there's more than discussion.

      [Enrico Fermi Institute] 12:45:39
      There's actually a couple of examples of doing this at at modest scale, but not that the the biggest one

      [Enrico Fermi Institute] 12:45:51
      Not across the finish line, but you know it's starting to show up in solicitations and things like this.

      [Enrico Fermi Institute] 12:45:57
      We have provided this feedback to the to the age funding agencies.

      [Enrico Fermi Institute] 12:46:03
      Before I think of the 2,019 child meeting and counting, discuss there.

      [Enrico Fermi Institute] 12:46:10
      This is, it's difficult to write it a generic of okay, you know, for collaboration that has some broad mix of workflows that competitive against a specific.

      [Enrico Fermi Institute] 12:46:26
      You know scientific. They get compared by scientific marriage, right?

      [Enrico Fermi Institute] 12:46:31
      And so, and they're looking for specific specific outcomes like, What did you discover on this machine?

      [Enrico Fermi Institute] 12:46:39
      Because you know, because we awarded you this this allocation, and that's difficult.

      [Enrico Fermi Institute] 12:46:47
      If you're saying, Ok, we ran, you know, just a you ran a generic mix of simulation experiment for 2 that at least, on the Nsf. Side.

      [Enrico Fermi Institute] 12:47:03
      This is where they are looking to tie this to? Yeah, but you know you get get the allocation as part of the Usf Us or Usms operations It's it's just not you know obviously Then again time for this proposal, and rounding and still scaling up I think this has

      [Enrico Fermi Institute] 12:47:27
      to be addressed I don't know if this is in this book.

      [Enrico Fermi Institute] 12:47:30
      Prep process. You know, we need to have much dedicated to this, but this is an issue, I mean, this is

      [Enrico Fermi Institute] 12:47:42
      Spend a lot of time writing and writing proposals, and they get reviewed by by a committee.

      [Enrico Fermi Institute] 12:47:47
      And yeah, the Lcf proposals. Basically, you have to dress another.

      [Enrico Fermi Institute] 12:47:52
      You have to do. You have to. Basically, we actually tried that because we did 2 proposals this year. One was Gpu.

      [Enrico Fermi Institute] 12:47:59
      We construction on summit, and that was approved. Because if something new is something we haven't done before, so the other one we intentionally kept it.

      [Enrico Fermi Institute] 12:48:12
      We didn't dress it up. That was General Monte Carlo production on theta like.

      [Enrico Fermi Institute] 12:48:17
      Get us get some resources increase, do like 10% extra. Well, just stand up for the color production that was rejected.

      [Enrico Fermi Institute] 12:48:25
      And that's basically what I It was. I expected that because it's not exciting.

      [Enrico Fermi Institute] 12:48:32
      It's something you can do everywhere. They look at it, they say, Why are you on the Lcf.

      [Enrico Fermi Institute] 12:48:37
      You can do this, Okay, somewhere else And that's the tension with the with the pledged allocation where the pledge is supposed to be able to do everything. But and again. I think that's why it, has to be a major outcome.

      [Enrico Fermi Institute] 12:48:49
      Report. There is we We have to make the agencies realize experiments that the global collaborations are write down their contributions to $0 and 0 cents because they can get some of these.

      [Enrico Fermi Institute] 12:49:06
      You know, in a way that we can actually plan and get them to

      [Enrico Fermi Institute] 12:49:14
      And part of it's gonna be a shift on on the Wc.

      [Enrico Fermi Institute] 12:49:17
      2 side, I think. But I I think it. We also have to kind of throw some cold water and the agencies to make them real.

      [Enrico Fermi Institute] 12:49:25
      Wake up and sit up and realize. Oh, I'm not getting my my credit for the money, and because that, you know effectively, they're putting in money.

      [Enrico Fermi Institute] 12:49:32
      Have a getting no credit for the money, and he should be.

      [Enrico Fermi Institute] 12:49:39
      I'm mad about it, and but what we've talked about yeah, 5, 6 years, and not lunch Let's move on to the last slide before reporting future workflows.

      [Enrico Fermi Institute] 12:49:53
      just looking forward. What we need to restrict kind of clothes that we run on clouds on Hpcs when we want to.

      [Enrico Fermi Institute] 12:50:00
      5 years from now. Will it make sense for us to partition our workflows?

      [Enrico Fermi Institute] 12:50:05
      Will we be able to expect Hpcs to just run all types of jobs Clouds be able to do that?

      [Enrico Fermi Institute] 12:50:11
      It sounds like, for clouds, The answer is kind of yes already, but that remains to be seen.

      [Enrico Fermi Institute] 12:50:16
      If the Hpcs will be able to do that, you know what what technologies, features, or policies are needed, you know.

      [Enrico Fermi Institute] 12:50:25
      Are there any capabilities provided by Pcr. Cloud that would allow us to run more clothes that we can't learn other places right?

      [Enrico Fermi Institute] 12:50:33
      and start to stiple some ideas here, but we will have some further discussion in the Rna section.

      [Enrico Fermi Institute] 12:50:40
      yeah, have a cloud. It seems like we can basically run whatever we want.

      [Enrico Fermi Institute] 12:50:43
      But we're limited by cost, and we can talk about that more in the yeah and the focus area.

      [Enrico Fermi Institute] 12:50:49
      But yeah, is open for restricted to the fact that there's sort of a because it's obviously easier.

      [Enrico Fermi Institute] 12:51:00
      If you can run everything, it's better, but you know, maybe we really should think sometimes for some machines should be different workflows, because that's what they're designed for.

      [Enrico Fermi Institute] 12:51:15
      But everything's not a now. It's a balance, because if you restrict it too much, it's completely on and and uninteresting for the experiment, and you will never be able to fetch it

      [Enrico Fermi Institute] 12:51:28
      I mean, if you want to fetch it, it has to be something that can run the majority of what you're doing, Otherwise I expect.

      [Enrico Fermi Institute] 12:51:37
      I mean, we discussed this over the weeks. I mean, we got the comment.

      [Enrico Fermi Institute] 12:51:42
      If it's only can run a generator, it might still th that will probably get your your proposal through.

      [Enrico Fermi Institute] 12:51:47
      But you're not going to get the hours credited

      [Enrico Fermi Institute] 12:51:52
      Most I mean at least not easily. Maybe that's one of the outcomes from this to push towards the actual Wlcg.

      [Enrico Fermi Institute] 12:52:02
      Sam on it. Hope you're hearing this so that we should get credit.

      [Enrico Fermi Institute] 12:52:08
      We should work towards the thing. We're useful, Useful computation gets credit, no matter what it is, but the use for the the pledging comes before the useful computation. If the resource is limited, that makes it less useful, resources you kind of basically I see the argument.

      [Enrico Fermi Institute] 12:52:28
      That when you go in I have this allocation 100 million hours on the Hpc.

      [Enrico Fermi Institute] 12:52:34
      Center, that and I can run one generator and then there's the tier one side.

      [Enrico Fermi Institute] 12:52:38
      I have 100 million hours equivalent over the whole year, or the allocation period.

      [Enrico Fermi Institute] 12:52:44
      It's worth more It's worth more to the experiment.

      [Enrico Fermi Institute] 12:52:47
      And I see that point

      [Enrico Fermi Institute] 12:52:50
      Okay, yeah, But again, we're where we are is right now.

      [Enrico Fermi Institute] 12:52:55
      We're we're saying it's Worth was right

      [Enrico Fermi Institute] 12:53:01
      Worthwhile that I I hope to get a 1 billion dollars a mad graph.

      [Enrico Fermi Institute] 12:53:05
      I was 1 billion hours of that front front somewhere, because I hope is that it enables something for experiment, or it awful lot from back one which is flexible.

      [Enrico Fermi Institute] 12:53:18
      So so do we need to be where we need to be pledging a different quality of service level.

      [Enrico Fermi Institute] 12:53:25
      I I don't know. I I I think it doesn't have to be for this blueprint, but I think somebody actually needs to step up and provide a proposal.

      [Enrico Fermi Institute] 12:53:37
      That people can disagree with. But but actually, somebody at some point needs to do some writing to say, Here's a model, I think, is useful, and and be able to so willing to think criticism right.

      [Taylor Childers] 12:53:49
      Isn't this a void? I mean this The larger question is, how do you make more?

      [Taylor Childers] 12:54:01
      the L. The Lhc workflows compatible with modern architectures. Right?

      [Taylor Childers] 12:54:07
      I mean, and of course I understand all the hang ups there.

      [Taylor Childers] 12:54:12
      I'm just saying that we can talk about what architectures aren't working for the HD.

      [Taylor Childers] 12:54:22
      Community. As long as we want, but we also need to be moving our our software in a direction that makes it easier to approach different hardware, cause I mean, it's just gonna get worse before it gets better you're going in in their own direction.

      [Enrico Fermi Institute] 12:54:41
      Yeah.

      [Taylor Childers] 12:54:45
      With hardware The Jones going in their own direction with hardware.

      [Taylor Childers] 12:54:49
      The Us is probably, gonna, I assume, continue with the us manufacturers.

      [Taylor Childers] 12:54:55
      okay, for political reasons, and of course, the Chinese are all developing their own and plan on having tons of compute power available.

      [Taylor Childers] 12:55:04
      So it's it's really question of Why can't we move in that direction?

      [Taylor Childers] 12:55:11
      And of course I I think we all know those answers.

      [Taylor Childers] 12:55:13
      But it maybe needs to travel up the up the chain.

      [Taylor Childers] 12:55:19
      One

      [Enrico Fermi Institute] 12:55:22
      Contract. It was

      [Kaushik De] 12:55:26
      yeah, right. Wanted to make a few comments about this this.

      [Kaushik De] 12:55:36
      I mean, it's not that it is useful for experiments to get access to resources.

      [Kaushik De] 12:55:46
      That may not be globally useful, and provide the value for particular workflows.

      [Kaushik De] 12:55:56
      I mean, we have the tools to make use of resources like that.

      [Kaushik De] 12:56:00
      Assuming we are not spending years of development and operational effort to to 2 use the resource, I think there's nothing wrong with having specialized resources as long as they're easy to use.

      [Kaushik De] 12:56:17
      I mean the experiments know how to use them.

      [Kaushik De] 12:56:19
      I think the the question is, how do you assign a value to that resource?

      [Kaushik De] 12:56:28
      I mean clearly that resource. You using the example that this is given, comparing, you know, a 100 million hours at Hbc.

      [Kaushik De] 12:56:38
      That only runs generators versus 100 million hours at 50 year, one that can do everything for the experiment.

      [Kaushik De] 12:56:43
      Clearly the 2 things are not the same, so the question is, how how do we assign different values to those 2 different kind of resources?

      [Kaushik De] 12:56:50
      And I think that is the real challenge for this working group.

      [Kaushik De] 12:56:55
      I mean, that's what we really need to come up with with our out of this workshop is is how do we assign a fair value to one more?

      [Kaushik De] 12:57:03
      Is the other

      [Enrico Fermi Institute] 12:57:09
      That's a great way to maybe, break for the for lunch break, and we have the Hpc focus area where we have more time to go into some of these things in more detail, and we'll have more slides prepared to cover some of that one thing I just want to mention before we closed the

      [Enrico Fermi Institute] 12:57:25
      framework. Developments are specifically supposed to be outside the scope.

      [Enrico Fermi Institute] 12:57:31
      I mean, we'll touch it a little bit, because it's some of the sets the scope of what's usable and what's not.

      [Enrico Fermi Institute] 12:57:36
      But we don't want to go into that Yeah, we have to partition somewhere.

      [Enrico Fermi Institute] 12:57:39
      Yeah, please, let's not go design to do that themselves.

      [Enrico Fermi Institute] 12:57:52
      You're from actually the Hbc. People like And I'm gonna go, Okay, So we'll we'll break for an hour.

      [Enrico Fermi Institute] 12:58:04
      We'll be back at one o'clock us Central time, and we'll do the Hbc.

      [Enrico Fermi Institute] 12:58:10
      Focuser.

      [Enrico Fermi Institute] 12:58:12
      See, everybody, then

      [Andrew Melo] 13:00:38
      everybody I had to pop out for a second, or we don't schedule the at one Pm.

      [John Steven De Stefano Jr] 13:00:48
      scheduled to reconvene in 1 h to oh, 2 Pm.

      [John Steven De Stefano Jr] 13:00:53
      Here in Eastern

      [Andrew Melo] 13:00:56
      Gotcha. Okay, So we're on schedule.

      [Andrew Melo] 13:00:57

       

      • 10:00
        Introduction 20m
        Speakers: Dirk Hufnagel (Fermi National Accelerator Lab. (US)), Fernando Harald Barreiro Megino (University of Texas at Arlington), Kenyi Paolo Hurtado Anampa (University of Notre Dame (US)), Lincoln Bryant (University of Chicago (US))
      • 10:20
        High Level Current Landscape and Use - HPC 20m

        - Landscape of workflows:  US HPC


        [Enrico Fermi Institute] 11:09:56
        Everybody can hear just fine. Yeah, Because I'm sitting here I'm just getting picked up by the mic on the ceiling. Okay?

        [David Mason] 11:09:59
        we can hear.

        [Enrico Fermi Institute] 11:10:01
        Great. Great. Thank you. Okay. So if you go to the next slide, So the first area that we want to cover is looking a little bit on the what we're doing in terms of workflows on Hbc and cloud and to do that maybe at the very first we look at what resources are we actually looking at

        [Enrico Fermi Institute] 11:10:19
        here at the right now So if you look at what's available for us, Hbc: we have broadly 2 types of facilities, and they have different new user experiences in terms of how you approach them how you can use them and there's the leadership class facilities funded by doe argon

        [Enrico Fermi Institute] 11:10:38
        Oakridge, and so on, and they are kind of.

        [Enrico Fermi Institute] 11:10:41
        They're very restricted. They focus on accelerators to get the most flops for good power budget.

        [Enrico Fermi Institute] 11:10:47
        They don't care too much about making it easy for the user.

        [Enrico Fermi Institute] 11:10:50
        You are expected to adjust your work, for to be able to run there, and they target large scale workforce.

        [Enrico Fermi Institute] 11:10:57
        This is the kind of stuff that you can do. Nowhere else.

        [Enrico Fermi Institute] 11:10:59
        You go to the and then the the user, facilities. Nurse tag, the exceed excess sites, which is they usually a mix.

        [Enrico Fermi Institute] 11:11:11
        Some of them are straight out like they look like Hpc compute clusters, and how they build.

        [Enrico Fermi Institute] 11:11:17
        Some of them have interconnects There might be a mix of gpus and cpus, mostly still cpus, and they take all comments.

        [Enrico Fermi Institute] 11:11:25
        Basically you can get an allocation. You can get going.

        [Enrico Fermi Institute] 11:11:28
        They work with you to try to make it easy to, so you can get on the facility and get your work next slide. And at any time if you wanna have a comment or a question, please just ask it we're not supposed to go.

        [Enrico Fermi Institute] 11:11:42
        Through the big presentation. So it's discussion. Yep.

        [Enrico Fermi Institute] 11:11:48
        so, and then looking at that, with that in mind, What are we currently running there?


        [Enrico Fermi Institute] 11:11:53
        So this is the right now, and the green. If you see a green that's straightforward copy from the charge, there's a question I was asked to ask, so to answer that here what we're doing right now so for cms.

        [Enrico Fermi Institute] 11:12:05
        What we're doing is we basically anything that starts with a generator step and has no input except for pile up.

        [Enrico Fermi Institute] 11:12:12
        We currently assigning to a lot of us Hbc sites You don't have to do anything special to work for gets injected with automatically.

        [Enrico Fermi Institute] 11:12:19
        You Can run there and that's that was the majority of run to Monte Carlo.

        [Enrico Fermi Institute] 11:12:24
        Workflows and the run. Three-month caller work was kind of, not a political sense, and for Atlas it's primarily simulation.

        [Enrico Fermi Institute] 11:12:33
        Usually they are specifically so to Hbc size. So you select the bunch of, I guess you pick.

        [Enrico Fermi Institute] 11:12:40
        This is this: This is a good fit, and then you assign it there, and it runs.

        [Enrico Fermi Institute] 11:12:42
        It, and they also have the goal to expand on that the limiting fact factors.

        [Enrico Fermi Institute] 11:12:53
        If in what workflows you target at Hbc are usually based on machine characteristics, So Cpu architectures, certain Hpc: I mean intel is easy to use when it gets beyond that, currently still a little bit difficult did you have a Gpu Accelerator: how much memory.

        [Enrico Fermi Institute] 11:13:10
        You have per call, remember, Perkins, with Kl. Kind of the dying breed, is kind of disappearing a bit.

        [Enrico Fermi Institute] 11:13:16
        So it's usually okay. Now then, network connectivity, And it's not just tune from the note, like by the Scf falls It's also for the facility as a whole.

        [Enrico Fermi Institute] 11:13:27
        Sometimes Hbc: Yes. Facility, restrictions or firewall limits where you once, when you scale up, you hit scaling limits where you basically overlook go to the pipe because they don't.

        [Enrico Fermi Institute] 11:13:38
        They're not used to such data. Intensive workflows.

        [Enrico Fermi Institute] 11:13:41
        So again I quick question back when we were talking about Cpu architecture and loading point up operations.

        [Enrico Fermi Institute] 11:13:47
        Yeah, what in particular is making that hard from your perspective. It's basically showing the arm or something going to arm is not harder.

        [Enrico Fermi Institute] 11:13:57
        It's just a matter of extra work to validate the platform.

        [Enrico Fermi Institute] 11:14:00
        Okay, So it's really about numerical outcomes and making sure that things agree between Yeah, it's basically a one-time investment of basically, being able to support the platform that's true, on all of them though cause that's not true for the the yeah, Olcf: was a bit is a bit of

        [Enrico Fermi Institute] 11:14:18
        a also have his power, you know. Cms just finished the power validation.

        [Enrico Fermi Institute] 11:14:23
        Okay, So you. So the the requirement, then, is the the effective requirement is for a given sort of Cpu architecture.

        [Enrico Fermi Institute] 11:14:32
        The upstream code has to be valid. It Well, firstly, you have to build your code.

        [Enrico Fermi Institute] 11:14:38
        I've got to be buildable, and then and then you need to run like whatever physics validation you produce, Some samples.

        [Enrico Fermi Institute] 11:14:43
        And then the physicist, the physics group, whatever in the global collaboration, needs to go in and say, this is actually okay.

        [Enrico Fermi Institute] 11:14:50
        So there's a depend. So, therefore, there's a dependency and a on external to to you.

        [Enrico Fermi Institute] 11:15:02
        requires labor from outside of us, because the Us.

        [Enrico Fermi Institute] 11:15:07
        Can't just say this platform is validated. The experiment, as a whole has to say that so, coming back to the why, you couldn't do pile up during digitization, because you had to read extra remotely you Can do it?

        [Enrico Fermi Institute] 11:15:24
        And that's we spot that basically we don't current.

        [Enrico Fermi Institute] 11:15:28
        We currently don't run anything that needs primary Newport but Pilot is supported because Pilot is is so unevenly distributed because of its size that we anyways for normal production even on some tier 2 sites they also read it remotely, so that's the use case that the support anyways to

        [Enrico Fermi Institute] 11:15:48
        the x, So the Hpc. Just expanded. So it's not a limitation.

        [Enrico Fermi Institute] 11:15:53
        No, yeah, I mean, eventually, as you scale up, it comes. Then the network connectivity comes in.

        [Enrico Fermi Institute] 11:15:59
        We have to look at that, for instance, at Frontera we're hitting scaling limits because of remote Parliament.

        [Enrico Fermi Institute] 11:16:06
        I thought it frontier. There was a limit on the amount of yeah, the amount of remote access you could do from around.

        [Enrico Fermi Institute] 11:16:13
        You can see. Yeah, So we actually hit the external connectivity limit of the facility.

        [Enrico Fermi Institute] 11:16:19
        And as I recall, Frontera, they mostly consider their ethernet to be like a control plane.

        [Enrico Fermi Institute] 11:16:26
        Each node in the rack is connected at one giving, and each rack is connected.

        [Enrico Fermi Institute] 11:16:31
        At 10 years ago. I think something like that to your core.

        [Enrico Fermi Institute] 11:16:36
        So in that case you probably weren't doing a lot of pile up at front.

        [Enrico Fermi Institute] 11:16:38
        We were reading Pilot: Okay, So you aren't hitting, I mean, But you are running.

        [Enrico Fermi Institute] 11:16:42
        You were used. You're accessing your pilot data sets by Ethernet, though.

        [Enrico Fermi Institute] 11:16:48
        Yeah, So you aren't hitting. You're still hitting the overall capacity of the of of attack.

        [Enrico Fermi Institute] 11:16:54
        Then, yeah, like 100 gig, or something like the well, we in the beginning, we hit the we actually hit the the scaling limitations on Fi: one trying to get okay.

        [Enrico Fermi Institute] 11:17:03
        And then they. They limited us. But it's it's fine, I mean, the limit is not restricting.

        [Enrico Fermi Institute] 11:17:10
        The limit is still. Hi Enough that we don't have a problem using up the allocation over email.

        [Enrico Fermi Institute] 11:17:14
        We just couldn't do what we tried to do. Which is, do these 100 K core groups?

        [Enrico Fermi Institute] 11:17:20
        Because at that point the traffic was too high. Yeah.

        [Enrico Fermi Institute] 11:17:27
        Oh, yeah, I was at the network connectivity. So we discussed the facility potentially for facility limits.

        [Enrico Fermi Institute] 11:17:35
        Here, then another limitation can be storage. A. If you use it, for if you use shared storage for input out to date output data, you would have to integrate it into the data management solution, because you Basically, have to prepase later you want to process and then to stage out the data, later, awesome

        [Enrico Fermi Institute] 11:17:51
        criminals from the job execution part 2 through your own storage, but also another.

        [Enrico Fermi Institute] 11:18:01
        The consideration is whether job scratch is local or shared.

        [Enrico Fermi Institute] 11:18:04
        For instance, the Lcf. Usually have only shared storage.

        [Enrico Fermi Institute] 11:18:08
        They don't give you any local storage. Most of the, And is that funded side access in fronttera?

        [Enrico Fermi Institute] 11:18:16
        They give you local scratch, and that is another area where you can run to scaling invitations.

        [Enrico Fermi Institute] 11:18:22
        and looking a bit ahead. So this is what we're doing now.

        [Enrico Fermi Institute] 11:18:26
        If you look ahead to the Hrxc area like assuming the resource mixed shifts, and we get more Hpc.

        [Enrico Fermi Institute] 11:18:36
        Resources can make. Can we have still a forward to restrict the workforce?

        [Enrico Fermi Institute] 11:18:42
        Everyone there. Oh, is it? Is that basically restricting ourselves in terms of what we can do operation.

        [Enrico Fermi Institute] 11:18:53
        And right now we do what's easiest, And that's that that just came out of starting up this.

        [Enrico Fermi Institute] 11:18:59
        And of course, you start up with what's easy to just get something to run.

        [Enrico Fermi Institute] 11:19:03
        But as you became experienced with it, and as the amount of resources goes up, that might not be enough to keep scaling up, I'm to take advantage of opportunities.

        [Enrico Fermi Institute] 11:19:15
        No, from Shaqi

        [Shigeki] 11:19:18
        just not a curiosity. This is sort of the state of trying to get to work at the end Pc.

        [Shigeki] 11:19:26
        Centers as they exist now. Is there any general motivation on the Hpc site side to sort of meet us halfway, and and or do they recognize that that that that maybe this is the future they really need to wreck to to meet the external workflows

        [Enrico Fermi Institute] 11:19:44
        Hey? That is there is, but you have to again distinguish between the user facilities and the Lcf: So with the user facilities, we've had very good experience, especially with nurse working with them.

        [Shigeki] 11:19:46
        halfway, and a common sort of way

        [Enrico Fermi Institute] 11:20:00
        nurse. We started like 2,016 cms, had our first allocation there, and we started to work, and we started to target these type of work.

        [Enrico Fermi Institute] 11:20:10
        Frozen, don't we? We tested remote data access and it was Kilobytes per second to each note and the claim the Corey design goal was gigabit to the knowledge.

        [Enrico Fermi Institute] 11:20:22
        From Ecf. Next time and then, obviously something in the stack didn't work.

        [Enrico Fermi Institute] 11:20:25
        So we worked with them for multiple years, And now it's actually we're kind of there.

        [Enrico Fermi Institute] 11:20:29
        Where we're supposed to be. Everything works great, so they are very interested in work with us.

        [Enrico Fermi Institute] 11:20:36
        the Lcf. I don't think we have that relation that that relationship

        [Steven Timm] 11:20:40
        cool.

        [Enrico Fermi Institute] 11:20:42
        I. It would be great if we had, but we don't

        [Steven Timm] 11:20:46
        So nurse goes also, already going over for Nurse Town, which is the machine It comes after pearl mudder talk to.

        [Steven Timm] 11:20:54
        oh, what do you call it? High throughput people, and see, What do we need for the next thing? So they're talking.

        [Steven Timm] 11:20:59
        They start numbers, they're talking to doing whatever so those means already happening for the next round.

        [Enrico Fermi Institute] 11:21:04
        Yeah.

        [Steven Timm] 11:21:05
        But the but the others, as you say, are not happening at the moment

        [Enrico Fermi Institute] 11:21:08
        Yeah, the feedback from nurse we got is that they're very interested, supporting data, intensive science And they took what they learned.

        [Enrico Fermi Institute] 11:21:16
        And Corey to running these kinds of workload staff They take that into consideration for designing the next machine Yeah, and in fact, I think what he will will But he will say hopefully, say something data, intensive science assume data intensive pulling stuff from the land because that's a different it's a

        [Enrico Fermi Institute] 11:21:34
        different issue, right? I mean, Yeah, it can be streaming things. You mentioned that I yes, as they scale up, you know, we want to put more workflows on.

        [Enrico Fermi Institute] 11:21:45
        We have to be cognizant of the the the intrinsic design limitations of the clusters, I mean they are intense of sign running data, intensive science on a facility that means you stream everything, in and stream it out or you need local storage to to to cash that value process

        [Enrico Fermi Institute] 11:22:01
        later that's that's the simple These are the 2 options here, and that's what I mentioned about storage.

        [Enrico Fermi Institute] 11:22:10
        It depends what each facility gives you. If you don't have a lot of attached storage, and you can get only a small storage board I've compared to your Cpu quota then you don't have a lot of options in terms of how to make use

        [Enrico Fermi Institute] 11:22:23
        of that Cpu quota. If you, if you do get a lot of storage, and you can run it like like we run regular production on a grid side.

        [Enrico Fermi Institute] 11:22:34
        We pre-stage with in our data management systems. We run you stage back things back out that makes things simple.

        [Ian Fisk] 11:22:40
        oh!

        [Enrico Fermi Institute] 11:22:41
        You say, Do you have an idea of like what the scale there would be to make it?

        [Enrico Fermi Institute] 11:22:45
        Make these facilities more cool. I mean, I know the all park figure we usually say Cms side with a sizable amount of Cpu would be would like to have like 500 TB space Roughly I I'd say simply hundreds of terabytes Yeah, I mean Yeah, we could use probably

        [Enrico Fermi Institute] 11:23:02
        3, 4, but around that that point, if it's less than 100, it gets difficult.

        [Enrico Fermi Institute] 11:23:06
        Yeah, And that's usually where we are with the experience from Lcc. Grants, for instance, usually 150 is kind of the cut of. That's not a lot

        [Enrico Fermi Institute] 11:23:23
        Of course it would be nice if we ask for a large storage allocation, and just, you know, you can ruse your storage element Yeah, you know.

        [Enrico Fermi Institute] 11:23:29
        Treat it like another site, but then that also comes into You know that the storage allocations over long periods of time to expect, rather than a yearly kind of allocation

        [Enrico Fermi Institute] 11:23:43
        Yeah.

        [Ian Fisk] 11:23:43
        Yeah, I'm I'm wondering if there, if somehow the concept is streaming in or to local storage, is a distinction without a lot of a difference.

        [Ian Fisk] 11:23:52
        It's more about the time. Scale, right? The they have a 100 TB of data.

        [Enrico Fermi Institute] 11:23:54
        Yeah.

        [Ian Fisk] 11:23:56
        You're either streaming it directly in real time, or you're staging it and staging it out because 100 TB of data is not a ton of space on a large scale.

        [Enrico Fermi Institute] 11:24:01
        Yeah, there's a small technical difference, because one just you just keep the data in job scratch.

        [Enrico Fermi Institute] 11:24:12
        And then the other case. You have to place it somewhere that's independent of job execution, and that that can have a technical difference, because I don't think, for instance, nurse doesn't count sharp scratch against your scratch border.

        [Ian Fisk] 11:24:27
        Okay.

        [Enrico Fermi Institute] 11:24:29
        While if you, if you put in something via the Dtn via the data transfer notes that does count against. And I think a lot of it's also cultural right in terms of

        [Enrico Fermi Institute] 11:24:42
        Not commonly seeing flows that stream data for better, and what what most people expect and toward are data coming and through Dtn's the file system.

        [Enrico Fermi Institute] 11:25:00
        And some time later processes. So the

        [Ian Fisk] 11:25:04
        But somehow, like there's a balance here that says between the networking and the local, storage and the Io of the jobs that you need to have a suspicion amount of.

        [Ian Fisk] 11:25:12
        I, or to keep the resources busy and so. It's not much more complicated like.

        [Ian Fisk] 11:25:20
        And in the test to be a convergent system, in the sense that you don't have, you're not gonna be able to have the storage forever.

        [Enrico Fermi Institute] 11:25:27
        Okay, Yeah, it's you, of course. Write that the Yeah, it's a It's a It's a storage management problem more than it?

        [Enrico Fermi Institute] 11:25:35
        Is, it's a storage problem, and you

        [Ian Fisk] 11:25:39
        Was that with I'm claiming it's a data delivery problem whether it's being streamed in or whether it's being cast from stream.

        [Ian Fisk] 11:25:45
        It's that they are effectively that both of them are the same problem, which is that, How do I get data?

        [Ian Fisk] 11:25:52
        And if I look at the time scale of a if something streaming in, it's sort of a real time problem, and it's it's it's a little bit simpler in the sense that I it's a it's a network, it's a I know the I o when there's

        [Ian Fisk] 11:26:03
        not a long like there's. But if I expand it out to the time scale of even just a couple of weeks, it's still I staging it in requires a certain amount of networking staging now.

        [Ian Fisk] 11:26:13
        How much time do I have this particular resources? It

        [Enrico Fermi Institute] 11:26:17
        So dark. Doesn't this depend on the scheduling modality of of the Hpc.

        [Enrico Fermi Institute] 11:26:22
        because like, because they they tend to come.

        [Enrico Fermi Institute] 11:26:26
        You know you tend to get put into a queue.

        [Enrico Fermi Institute] 11:26:29
        You're waiting for another. And then suddenly, you have on use.

        [Enrico Fermi Institute] 11:26:37
        It's simpler If If you'd strange, you remove the data management hard from the equation, because you assume you just can pull it when you need it.

        [Enrico Fermi Institute] 11:26:48
        But you can't do that if you're being scheduled for you, where you're suddenly getting 50,000 Course you've been waiting for 2 weeks, Then on Monday morning they give you 50,000 cores.

        [Enrico Fermi Institute] 11:26:57
        And you've got no data there right Well, if you assume that these 50,000 cores can access the data via streaming, then you can hold them.

        [Steven Timm] 11:27:04
        Great

        [Enrico Fermi Institute] 11:27:06
        Yeah, You hold them somewhere else, and you don't need to schedule the data.

        [Enrico Fermi Institute] 11:27:09
        So data deliveries on demand. If you efficiency eventually, you hit scaling limits.

        [Steven Timm] 11:27:09
        Right.

        [Steven Timm] 11:27:12
        Good.

        [Enrico Fermi Institute] 11:27:18
        But that's more a question that then the network comes in, and how our own sides are dimension.

        [Enrico Fermi Institute] 11:27:23
        This is still the introduction we have the Dhc focus area.

        [Enrico Fermi Institute] 11:27:27
        We also have a couple, so I don't want to go too deep into it.

        [Steven Timm] 11:27:28
        Okay.

        [Enrico Fermi Institute] 11:27:30
        But I think the point is, if you think about architectural point of view.

        [Enrico Fermi Institute] 11:27:33
        Having the data that you need on site for your your could be enormously, because it's probably sized you.

        [Steven Timm] 11:27:37
        Alright.

        [Enrico Fermi Institute] 11:27:43
        You hope that if uses that the site is sized, appropriate for course may or may not be true on cases, and he also of of reliability.

        [Steven Timm] 11:27:48
        Great

        [Steven Timm] 11:27:52
        Okay.

        [Enrico Fermi Institute] 11:27:53
        You want to do is wait 2 weeks. Get your 50,000 cores if you're ped out.

        [Enrico Fermi Institute] 11:27:58
        Today was the day that there was this So we got a couple of 2 questions.

        [Ian Fisk] 11:27:59
        But

        [Enrico Fermi Institute] 11:28:04
        She got again

        [Steven Timm] 11:28:05
        Yeah. So you have to consider it only the size of the file system.

        [Shigeki] 11:28:08
        Hello!

        [Steven Timm] 11:28:12
        Sorry, but also the reliability of the file system, and also the eye ups of reading the file system, because we managed to scramble the luster file system pack pretty badly several times.

        [Steven Timm] 11:28:24
        Thanks Larry I'm not sure it's Lester.

        [Steven Timm] 11:28:26
        But anyway, I mean, it's just scramble They're scratched very badly.

        [Steven Timm] 11:28:29
        Call times in cool motors, having issues too. It's not our fault.

        [Steven Timm] 11:28:33
        But there! Oh, spacecraft systems are not always meant to take seamless level.

        [Steven Timm] 11:28:38
        I o Yep, we have to be prepared for this.

        [Enrico Fermi Institute] 11:28:43
        Oh, I always especially if you look at generator type or flows, is not great.

        [Steven Timm] 11:28:43
        Something won't be

        [Enrico Fermi Institute] 11:28:49
        Not they're basically built for desktop and we scale it up to gridlock.

        [Enrico Fermi Institute] 11:28:54
        If we have Joe coming

        [Shigeki] 11:28:56
        Yeah, I guess my, my my fundamental question is, is sort of all of these issues are sort of best done at the design phase of the Hpc center.

        [Shigeki] 11:29:06
        And I'm kind of wondering. Does the community have an official avenue in which to present our issues and and and work with them at the design space of the Hpc center, where where we can we can both agree on on on the the the mechanism for moving the data in and out

        [Enrico Fermi Institute] 11:29:27
        Not really, not at the moment I think the the user facilities are at least aware of what we're doing.

        [Enrico Fermi Institute] 11:29:34
        The type of work we're doing because they see this more often.

        [Enrico Fermi Institute] 11:29:37
        The Lcf. I don't think we are not not at this level, because they are really.

        [Enrico Fermi Institute] 11:29:45
        They're targeting these things. Give me a 1,000 notes from my letters QCD.

        [Enrico Fermi Institute] 11:29:50
        Calculation or protein folding, or whatever they're doing.

        [Enrico Fermi Institute] 11:29:52
        Let's stay out the target market basically

        [Shigeki] 11:29:55
        but I mean, probably that's because of the fact that that's the target market that they see.

        [Shigeki] 11:30:00
        And it's sort of a chicken and egg problem.

        [Shigeki] 11:30:01
        They're not going to see the high throughput issues, because it's so hard to do it, and they're not gonna do anything about it because they just don't see it it's it's really a chicken and end.

        [Enrico Fermi Institute] 11:30:11
        But then falsehood on their Congressional mandate.

        [Enrico Fermi Institute] 11:30:13
        So why would they go against the Congressional mandate that I think this is also a discussion.

        [Enrico Fermi Institute] 11:30:18
        That's that's we do. That's too high level for us to have any imported.

        [Enrico Fermi Institute] 11:30:25
        So I know they have discussions going on at the very high level for them to support these type of science better.

        [Enrico Fermi Institute] 11:30:33
        But until there's actually a as Brian said, as A, until there's actually a mandate for them, and sometimes that they're supposed to support us better.

        [Enrico Fermi Institute] 11:30:41
        I don't think they're going to move a lot in terms of making making their facilities work better computation that they're doing so What I mean.

        [Enrico Fermi Institute] 11:30:53
        Is that Apsu works with Alc. F. About taking data from their light source and streaming.

        [Enrico Fermi Institute] 11:31:04
        I believe nurse is in conversations with a couple of the West Coast light sources, and I remember one talk I was at.

        [Enrico Fermi Institute] 11:31:13
        I think Olcf was talking about doing that also from like the neutron source, and some of the accelerators on on campus.

        [Taylor Childers] 11:31:21
        can I? Right Yeah, bye? Sorry: So I was just.

        [Enrico Fermi Institute] 11:31:22
        So we have a comment from from Taylor: Correct: Yeah.

        [Taylor Childers] 11:31:28
        Gonna And I mean Doug brought up another good point. But so, just to comment on a few of the things.

        [Taylor Childers] 11:31:36
        so I'll go to Aps first. So our new Polaris machine actually has 60 some odd nodes dedicated like we purchased in addition for the Aps for real time processing so the idea is that the you know, workflows there have live detectors that are

        [Taylor Childers] 11:31:59
        taking data. And we want to see if we can get those scientists on our machines when it comes to the design process for the new machines.

        [Taylor Childers] 11:32:10
        Right, for instance, with Aurora we had the Aurora Early Science program.

        [Taylor Childers] 11:32:16
        Olcf had a similar program same for pro mutter.

        [Taylor Childers] 11:32:20
        Those are entirely designed to how communities get on, you know.

        [Taylor Childers] 11:32:27
        Get early access to our machine that occurred. Atlas submitted one of those projects, and has had myself, and, in fact, a postdoc funded through Alcf to help mostly event generators.

        [Enrico Fermi Institute] 11:32:28
        Yeah.

        [Taylor Childers] 11:32:46
        At this point, user Aurora moving forward. So there is a program for helping to be involved in the early process of design for the machine.

        [Taylor Childers] 11:33:02
        So, for instance, with the Atlas case, Mad Graph is constantly reported in the Intel meetings for Aurora.

        [Taylor Childers] 11:33:11
        As far as performance and capability, because, you know, we're one of the early science project for projects.

        [Taylor Childers] 11:33:23
        but the other, I would say the other end of the spectrum.

        [Taylor Childers] 11:33:27
        There is. Of course, if you're a big user, right?

        [Taylor Childers] 11:33:30
        And I think Atp has always had the potential to be big users at the Lcfs.

        [Enrico Fermi Institute] 11:33:31
        Okay.

        [Taylor Childers] 11:33:39
        granted. There are hurdles, especially now with architectures, but if you're a big user, you have a big sway, right?

        [Taylor Childers] 11:33:49
        I mean, the lattice. QCD. Groups. They can use our entire machines.

        [Taylor Childers] 11:33:53
        They use them effectively, and of course we panander to them, I would say unofficially, I guess, but I mean, they get huge sway at our meetings because they are able to effectively use our our resources and same for like, I mean everybody knows the hack group solman's group

        [Taylor Childers] 11:34:12
        and the climate scientists, right material scientists, the software that our community base where they're easy to port to the next generation.

        [Taylor Childers] 11:34:23
        Hardware. They move quickly. The communities move quickly, and they all use similar hard software.

        [Taylor Childers] 11:34:28
        They get a lot of pull in those discussions. Now, the the last thing I wanted to mention, the difference between nurse and the Lcs, I would say, is that Lcfs.

        [Taylor Childers] 11:34:42
        get less. They have less

        [Taylor Childers] 11:34:48
        Funding for deploying a lot of user centric hardware.

        [Taylor Childers] 11:34:54
        So we've been talking to Alcf. I don't know how long for trying to get up a you know.

        [Taylor Childers] 11:35:01
        Aside cluster for kubernetes and stuff like that where you guys could run all of these services. And, as far as I can tell her up, team, our operations team is just swamped with stuff to do and so that becomes a limiting factor for us

        [Enrico Fermi Institute] 11:35:21
        Thanks, Taylor. I think that was kind of the the direction of my comment.

        [Enrico Fermi Institute] 11:35:26
        We We have to make sure, you know. Pretty good at Lcf. If they build a machine to be Hpc machines, there's a you you want to make yourself look like the qCD folks and do Hpc.

        [Enrico Fermi Institute] 11:35:39
        Work. It's it becomes a huge. Ask for them to to try to do Htc: type Workforce because of the exact sort of pressures you just outlined.

        [Taylor Childers] 11:35:51
        Yeah.

        [Enrico Fermi Institute] 11:35:52
        So we have a couple more questions on Zoom. Let's take these questions and then move on to the Cloud section column.

        [Paolo Calafiura (he)] 11:36:02
        I guys. So it's actually a comment following up on this.

        [Paolo Calafiura (he)] 11:36:07
        And I I find that if I'm useful sometimes to think, to put boot myself in the shoes of the other of the other partner, when when we have any discussion, I mean think think it from the point of view of of an Lcf today, basically Hp.

        [Paolo Calafiura (he)] 11:36:25
        Is using Hpcs at arms. Length. Let's be honest, I mean we We have some nice tier, 2 like facility.

        [Paolo Calafiura (he)] 11:36:31
        A nurse. We we we are pretty happy with with the way Nasty is is working, but you know QCD.

        [Paolo Calafiura (he)] 11:36:42
        we're talking about. If the Lcf.

        [Paolo Calafiura (he)] 11:36:44
        Did not exist, today will not be able to assignments.

        [Paolo Calafiura (he)] 11:36:46
        And so that is something that then, as yes, anyone I mean, we'll consider I mean, am I fundamental, or am I just one of the 25?

        [Paolo Calafiura (he)] 11:36:55
        32 in the in the in the Federation.

        [Paolo Calafiura (he)] 11:37:00
        So it is. I think I think, at least for the next generation of Hpcs, not Oururora, but the one after our own, the ones which will start in the twenty-firties or so maybe we have maybe, we have a shot but we will need to make 2 today to make a

        [Paolo Calafiura (he)] 11:37:21
        company which I don't know if we are ready to make today, which is to say that the at least in the Us.

        [Paolo Calafiura (he)] 11:37:29
        the Hpcs would become a fundamental part, and not just a beyond the pledge accessory to our 2, 1: one yeah, That's that's that's also because of the enormous amount of effort we would have to put as is, this is, it being said, a couple.

        [Paolo Calafiura (he)] 11:37:47
        Of times, to to be able to exploit these architectures.

        [Paolo Calafiura (he)] 11:37:51
        So I think either we jump or we or we stay with the with our friendly talk, and there's people not to work there

        [Enrico Fermi Institute] 11:37:58
        Okay.

        [Enrico Fermi Institute] 11:38:06
        in comments.

        [Ian Fisk] 11:38:07
        yeah, yeah, my comment was sort of along the lines of I've also responded to Shaggy, And from, I think, one of the things we need to be a little bit careful of is sort of what our expectations are and the biggest one is that these facilities were not built for us and that we know.

        [Ian Fisk] 11:38:23
        that but that doesn't mean that they can't be useful to us.

        [Ian Fisk] 11:38:27
        at the same time, we can't expect to use all of them, and I think well, there's a frontier is 10 times the size of the Wcg.

        [Ian Fisk] 11:38:36
        Combined in terms of of floods, and so wouldn't even want to use the whole thing.

        [Ian Fisk] 11:38:42
        but from the standpoint of like the stability of the palaces, my Steve was saying, the the scale of file system.

        [Ian Fisk] 11:38:49
        I think all these things are things that we actually can measure, and benchmark, and look at how much of a of a Lcf.

        [Ian Fisk] 11:38:55
        We might reasonably able to take advantage of in the workflow that is not designed for it.

        [Ian Fisk] 11:39:00
        And instead of having an expectation that they will be somehow different, they will design these facilities for us.

        [Ian Fisk] 11:39:04
        They won't. They built for already, And the question is like, Can we still is, is a Ferrari still useful to us at some scale, and the only really way to do that is to measure it is to have a a Benchmark?

        [Ian Fisk] 11:39:16
        Which we can use, says this is how many resources you can expect to take advantage of before you exceed the local file system where the local network, or the local whatever else, and it seems like this is a tractable problem and These resources exist.

        [Ian Fisk] 11:39:33
        We we can over the course of time. If we demonstrate that we use them at all, maybe we'll have an influence on the next generation to make them useful for us, too, but I I think that it's not we're not gonna be a situation where we can have basically all of our stuff looks like ai

        [Ian Fisk] 11:39:49
        and so it's it's a simple transition over to Hbc.

        [Ian Fisk] 11:39:53
        We're not gonna like our stuff. Looks like our stuff.

        [Ian Fisk] 11:39:56
        It's not gonna look like lattice is not gonna look like, Yeah, I necessarily completely.

        [Ian Fisk] 11:40:00
        But we I think, if we say we know what our workflows look like.

        [Ian Fisk] 11:40:04

      • 10:40
        High Level Current Landscape and Use - Cloud 20m

        - Landscape of workflows: Cloud

        [Enrico Fermi Institute] 11:41:08
        For the moment and to cloud, I think, and I know we'll go through these slides

        [Fernando Harald Barreiro Megino] 11:41:14
        hi, Yeah, So now it's the similar discussion. But for Cloud, what are the work that can be executed?

        [Fernando Harald Barreiro Megino] 11:41:26
        cloud resources and before getting there, what we have been mostly considering during our previous discussions for the blueprint process I like the Major Commercial Cloud Provider, like Google Amazon Microsoft, which are the ones that we have been really testing in the in the last couple of years.

        [Fernando Harald Barreiro Megino] 11:41:45
        and here all of these have different service levels, so they provide infrastructure as a service where you would run that machine install, and then whatever you want platform as a service for higher level software, as a service but nowadays, all of these clouds also have emerging intermediate levels in particular

        [Fernando Harald Barreiro Megino] 11:42:05
        container from service. Like, for example, coordinates were other versions of Kubernetes, like service container executions.

        [Fernando Harald Barreiro Megino] 11:42:16
        on these services along like cloud, native approaches, to integrate our experiment.

        [Fernando Harald Barreiro Megino] 11:42:24
        frameworks across the cloud providers, so that all of them look the same.

        [Fernando Harald Barreiro Megino] 11:42:30
        Yeah, And then the other thing that is the other cloud provider that this being lately is

        [Fernando Harald Barreiro Megino] 11:42:47
        I'm they differentiate themselves through in particular, like sustainability, and the usage of the renewable energy They are also much more affordable than Google, But they how are also not a full blown cloud they just have a limited services and also

        [Fernando Harald Barreiro Megino] 11:43:09
        reliability probably depending on on how much renewable energy at the moment.

        [Fernando Harald Barreiro Megino] 11:43:17
        And so Cms is trying to. I've integrated them once.

        [Enrico Fermi Institute] 11:43:18
        Okay.

        [Fernando Harald Barreiro Megino] 11:43:20
        I'm already for some simple tests. So next slide

        [Fernando Harald Barreiro Megino] 11:43:29
        so for for outlaws. Then, coming to the question, What are the hey guys that are possible to execute on the cloud?

        [Fernando Harald Barreiro Megino] 11:43:39
        So we are integrating lately. Clouds like independent, complete, completely independent, and self managed sites with a storage element, that also compute for that is integrated in Panda the most What we have the most experiences.

        [Fernando Harald Barreiro Megino] 11:44:01
        With. Google. And we started in middle of tonight to run.

        [Fernando Harald Barreiro Megino] 11:44:05
        similar to our Us. The tool, size, the cluster, and we are running now.

        [Fernando Harald Barreiro Megino] 11:44:11
        10,000 calls and currently are limited to production workloads.

        [Fernando Harald Barreiro Megino] 11:44:17
        But that's just because we are reorganizing the the storage behind it, and we plan to enable them now this is in a couple of weeks.

        [Fernando Harald Barreiro Megino] 11:44:29
        the one thing that maybe you want to control is the amount of E address to to to to bring down the cost.

        [Fernando Harald Barreiro Megino] 11:44:39
        if you want to do that, the obvious choices to do that, run simulation.

        [Fernando Harald Barreiro Megino] 11:44:44
        But we are also now starting to experiment with full chain, where where you run the full, all of the all of the tasks within.

        [Fernando Harald Barreiro Megino] 11:44:55
        we can the simulation, same or production thing, and we don't export the intermediate products.

        [Fernando Harald Barreiro Megino] 11:45:01
        But just with, I'm just in the in the plot I wanted to show is but depending on the workload that you are running, you're egress costs. Come by a lot, and that's why we motivate this trying to keep and then the other thing.

        [Fernando Harald Barreiro Megino] 11:45:24
        That we have been experimenting in, the cloud is our announced facility type of setups.

        [Fernando Harald Barreiro Megino] 11:45:31
        with elastic, scaling, so that we set up. I know this is facility with 2 bitter and tasks We keep the like the general components of the running on the cloud to a minimum and only scale, out and a lot of vms when they are requested by a user to

        [Fernando Harald Barreiro Megino] 11:45:49
        run a product does computation, and this is also a very suitable setup for for the cloud, because you just pay for the resources that you are using at the moment

        [Fernando Harald Barreiro Megino] 11:46:03
        then in the next slide

        [Fernando Harald Barreiro Megino] 11:46:07
        So this is the landscape of a close for the Cms.

        [Fernando Harald Barreiro Megino] 11:46:12
        I don't know if Keny wants to talk about it, or me, too.

        [Fernando Harald Barreiro Megino] 11:46:16
        Go through it

        [Kenyi Paolo Hurtado Anampa] 11:46:18
        yes, so in essence, back in 2,016, the Boston I haven't seen monthly.

        [Kenyi Paolo Hurtado Anampa] 11:46:26
        We done a little more to try different call providers to run production workloads, and at the what was done with the young gang team Did you record workloads, But basically shows that if we we can ron any kind of production workflows in the cloud and you can see

        [Enrico Fermi Institute] 11:46:34
        Okay.

        [Kenyi Paolo Hurtado Anampa] 11:46:52
        diagram. They're on the right, bye, and these are when the formula Facility wasn't standard.

        [Kenyi Paolo Hurtado Anampa] 11:47:00
        In order to get twice the number of resources that will be initially at from the global phone.

        [Kenyi Paolo Hurtado Anampa] 11:47:06
        So this is showing like a £150,000.

        [Kenyi Paolo Hurtado Anampa] 11:47:11
        hi! There! On top of the basically will be integrated the resources to kept out that that was also integrated will be gliding 3 as part of, and as as of today, we we we can't use it use this is that there is some work

        [Enrico Fermi Institute] 11:47:34
        Yeah.

        [Kenyi Paolo Hurtado Anampa] 11:47:39
        on to choose this. For example, specialized analysis workloads that depend on machine learning, inference.

        [Kenyi Paolo Hurtado Anampa] 11:47:48
        So there is some to at the

        [Enrico Fermi Institute] 11:47:59
        Okay.

        [Kenyi Paolo Hurtado Anampa] 11:48:01
        Utilize what gpus and to use drone different cloud providers.

        [Kenyi Paolo Hurtado Anampa] 11:48:10
        there is one in France, server, called treatons, that there is.

        [Kenyi Paolo Hurtado Anampa] 11:48:18
        There is that that was also integrated as part of Sonic, And do with that.

        [Kenyi Paolo Hurtado Anampa] 11:48:25
        You can. The third running analysis, the analysis pipeline, Both the machine learning springs through 3 times.

        [Kenyi Paolo Hurtado Anampa] 11:48:37
        cloud providers there, or give using cpus

        [Enrico Fermi Institute] 11:48:41
        I I can put some numbers in. I think they. They ran on 10,000 Cpu core.

        [Enrico Fermi Institute] 11:48:50
        There's 10,000 Cpu cores, and they rented a 100 Gpus and sped up the the workflow was running on the cpus by 10,%, so in that game you basically you invest a little bit in Gpus just speed up the calculation that runs on the on

        [Enrico Fermi Institute] 11:49:05
        the cpus, or third user 10,000, to how many? Gpus? 100, I mean, It's it's early. It's early work so hopefully, that ratio you can reduce that but that Was what they were testing

        [Enrico Fermi Institute] 11:49:20
        Okay? Or comments on landscape of cloud

        [Enrico Fermi Institute] 11:49:30
        Very much to bring out, otherwise we can move on to acquisition operation.


        [Fernando Harald Barreiro Megino] 11:49:35
        okay.

        [Ian Fisk] 11:49:36
        sorry I I have comments and it, and I I I needed. I was thought I was talking sorry.

        [Ian Fisk] 11:49:42
        This is Ian. So the general comment was, We have this issue about the egress charges which I've never, we don't ever seem to have as a solution, for, except not to export data.

        [Enrico Fermi Institute] 11:49:43
        Okay.

        [Enrico Fermi Institute] 11:49:43
        Okay, Got it.

        [Steven Timm] 11:49:56
        no, not so. There are agreements.

        [Ian Fisk] 11:50:03
        But the agreements are always things like it's if it's 15% of the billing charges, we won't like it.

        [Ian Fisk] 11:50:09
        There, there's there's ways to make it reduce.

        [Ian Fisk] 11:50:11
        But at fundamentally this is a This is a business practice that they do to lock, to do, vendor, lock, in, and they're not so.

        [Ian Fisk] 11:50:19
        Far at least, no one's been proposing to not do it.

        [Ian Fisk] 11:50:21
        And so we're always okay.

        [Enrico Fermi Institute] 11:50:23
        2 things: Lanceium does not have egress charges.

        [Ian Fisk] 11:50:26
        Okay.

        [Enrico Fermi Institute] 11:50:27
        So with the limitation that they we're still exploring and that's very early going.

        [Steven Timm] 11:50:28
        Pretty good.

        [Enrico Fermi Institute] 11:50:32
        But by design, at least what they're saying now. They don't charge egress.

        [Ian Fisk] 11:50:37
        Right.

        [Enrico Fermi Institute] 11:50:38
        And then, Fernando, you want to say something about this subscription.

        [Enrico Fermi Institute] 11:50:41
        What? That model is because I

        [Fernando Harald Barreiro Megino] 11:50:43
        I could to discuss that in the tomorrow during the Cloud session.

        [Ian Fisk] 11:50:47
        Okay.

        [Fernando Harald Barreiro Megino] 11:50:48
        But I mean, so basically the agreement. We have the with Google is it's a subscription agreement.

        [Fernando Harald Barreiro Megino] 11:50:57
        And that's basic that's like a flood rate.

        [Fernando Harald Barreiro Megino] 11:51:00
        You agree on a price on the amount of resources that are included.

        [Fernando Harald Barreiro Megino] 11:51:03
        I'm doing will not be touched. Like there is no meter on how much egress you have.

        [Fernando Harald Barreiro Megino] 11:51:08
        You would do, which is a a fixed price for your 15 months of

        [Ian Fisk] 11:51:14
        Yeah. Okay. So at the I guess the the question is, is the at the end of your 15 months, if you want to use the last month only to export your data and get out of the cloud that would be within the confines of the model is that a troops statement

        [Fernando Harald Barreiro Megino] 11:51:32
        was in.

        [Fernando Harald Barreiro Megino] 11:51:33
        Was in. As you are running jobs, the output is always exported, and that's what the we are always running The the eagles cost

        [Ian Fisk] 11:51:40
        Okay, Alright: Okay, it's it's I guess my my point is this is this is this is a fundamental problem, which is that we we can only use the essentially the cloud with a lot. Like Hpc: except that with Hpc: we propose for the data

        [Enrico Fermi Institute] 11:51:42
        Yeah.

        [Steven Timm] 11:51:51
        Yeah.

        [Enrico Fermi Institute] 11:52:01
        Good. Yeah, I mean, what it comes, I mean. But my opinion on the cloud is that the workforce, selection, and capabilities is not the issue yeah, because we we can do anything we want on the cloud it's just the machine you rent the question comes down How what's the cost?

        [Steven Timm] 11:52:22
        Great Great. Well, this one, you

        [Enrico Fermi Institute] 11:52:23
        And How do they structure the pricing price? What they want? 2 and 2 allowed to do, and in what way?

        [Enrico Fermi Institute] 11:52:29
        What are illustrations.

        [Ian Fisk] 11:52:30
        And and there's one and the other thing is my other point I want to make was one of the fundamental differences between sort of Hbc.

        [Ian Fisk] 11:52:37
        And cloud is that Hbc. Relies almost exclusively at at the leadership class on accelerated Gpu style.

        [Ian Fisk] 11:52:44
        Hardware, and that's and it's not the client.

        [Ian Fisk] 11:52:48
        Don't have them but that's the most expensive elements on the cloud, and it's because they depreciate so fast that the cloud providers need to recoup that cost in more in a shorter period of time.

        [Ian Fisk] 11:52:59
        They do for Cpu. Can you find that the the economics of the Gpu and the Cpu are different on the cloud

        [Enrico Fermi Institute] 11:53:09
        It's also structural. I I No, I'll I'll leave that comment because we do have the cloud focus there tomorrow.

        [Ian Fisk] 11:53:15
        Okay, right

        [Enrico Fermi Institute] 11:53:15
        We should not try to have all the discussions now let's have a comment for me.

        [Enrico Fermi Institute] 11:53:20
        Honest.

        [Johannes Elmsheuser] 11:53:22
        yeah, can just to follow up on on the egress right?

        [Johannes Elmsheuser] 11:53:26
        And so if you go one slide back to slide 11, right Fenando has a little bit of for breakdown there.

        [Johannes Elmsheuser] 11:53:34
        of the different costs, right and and it's always, I think, some fear, that egress is really humongous compared to what else right, but from what we are seeing, we to running, for example, on adwords.

        [Johannes Elmsheuser] 11:53:47
        And there doing physics, validation that the egress is not the overall driver here, unless you do really crazy stuff right?

        [Johannes Elmsheuser] 11:53:57
        So when you have a regular simulation task, egress is not dominant, and it's really the Cpu.

        [Johannes Elmsheuser] 11:54:03
        that you are scaling up, that is driving the cost.

        [Johannes Elmsheuser] 11:54:06
        Here it is obviously something that you are using with egress on top us.

        [Johannes Elmsheuser] 11:54:13
        You have to pay compared to Hpc: That's that's no no discussion here.

        [Johannes Elmsheuser] 11:54:17
        But it's also not humongous when when you compare everything in and have to fold everything in here right? I just want to make that statement, and I I think, we can discuss this in more detail than later in the dedicated cloud session

        [Ian Fisk] 11:54:29
        I would claim that I would claim that it was not humongous as long as you're in a very structured environment.

        [Ian Fisk] 11:54:35
        And you are. Be acting a predictable way that the date will be up to analysis, like, at least for us.

        [Johannes Elmsheuser] 11:54:38
        Yeah.

        [Ian Fisk] 11:54:41
        We had a user, So browse some data that we weren't expecting and ran up at $75,000 export bill in a month.

        [Johannes Elmsheuser] 11:54:50
        do I sure I mean that that is then the way how you structure your workflows Absolutely. I I fully agree.

        [Johannes Elmsheuser] 11:54:57
        So, if if you have an agreed workflow there, and here we we are showing production that that's totally clear, right? And you don't want to have the surprises from some unstructured use, analysis, fully agreed

        [Enrico Fermi Institute] 11:55:14
        Is there a comment from Paul

        [Paolo Calafiura (he)] 11:55:16
        yes, I mean I I I I feel I'm becoming and becoming like a broken record.

        [Paolo Calafiura (he)] 11:55:25
        But once again I think this slide shows you the benefits of committing versus versus taking, You know, a handslink approach.

        [Paolo Calafiura (he)] 11:55:34
        So We have always said that the the cloud is a great way to, you know, to do.

        [Paolo Calafiura (he)] 11:55:39
        Excel computeing, like the slide at the bottom kind of suggests, you know, without, you know, when we need something for doing analysis, we will use it.

        [Enrico Fermi Institute] 11:55:39
        Hmm.

        [Paolo Calafiura (he)] 11:55:49
        And then our our loads will be will be elastic, and that's what's expensive.

        [Paolo Calafiura (he)] 11:55:55
        But what? Of course, Once again take the point of view of the band, or one of the vendor ones. Yes, and they want to to lock you in and lock you in, and not necessarily with some evil evil of mechanism, but just by offering you a good subscription deal so that you take some of the money, that

        [Enrico Fermi Institute] 11:55:55
        Okay.

        [Enrico Fermi Institute] 11:56:11
        Yeah.

        [Paolo Calafiura (he)] 11:56:13
        you otherwise would spend on your own. Hard, do it, and give it to them, That's and so there is a lot.

        [Paolo Calafiura (he)] 11:56:19
        There is a lock in there, because, of course, the price is constant for 12 months or 50 months, but it can change from one year to the next, and it will be as if, as it should so, you you are logged in because, then you don't have anymore, let's say all of your pr one or

        [Paolo Calafiura (he)] 11:56:38
        tier, 2 hardware, and then you are locked in with them.

        [Enrico Fermi Institute] 11:56:44
        kosher

        [Kaushik De] 11:56:47
        yeah, coming back to the other point, I'm sure it will be discussed tomorrow.

        [Kaushik De] 11:56:55
        During the dedicated session also, but since it came up, the issue of hydrogen at the in the cloud the heterogeneity is actually extremely useful and extremely good.

        [Kaushik De] 11:57:09
        In the cloud we are using both Amazon and Google for studies with Fpgas with arm, with Gpus, and there is no in setting up those resources because they're already available in the Cloud so I think the usefulness of highly specialized

        [Kaushik De] 11:57:41
        hardware at minimal minimal cost, because we don't pay for setting them up in the cloud.

        [Kaushik De] 11:57:47
        They're already there, but we can go in there, and we can use them, and that is an enormous resource for experiments, because I mean, if we had to set up our own Fpga Farm or arm, phone or or Gpu farm in order to I do some of the studies

        [Kaushik De] 11:58:03
        it Yeah, be private differently expensive.

        [Ian Fisk] 11:58:07
        right, and and and I didn't mean to imply that there wasn't real value in the diversity of resources on the cloud.

        [Ian Fisk] 11:58:14
        I was only commenting that at the production scales that we can can become very expensive

        [Enrico Fermi Institute] 11:58:25
        Are coming from Fernando

        [Fernando Harald Barreiro Megino] 11:58:27
        yeah, it's that question. And so again, now about the egress cost.

        [Fernando Harald Barreiro Megino] 11:58:34
        So. There is always so legend that if there is appearing between, let's say I thought, provide on.

        [Fernando Harald Barreiro Megino] 11:58:43
        For example, Yes, net. You can bring down the egos cost, and I wanted to ask if that's really true, or just something that we had.

        [Fernando Harald Barreiro Megino] 11:58:55
        But no one really about it

        [Enrico Fermi Institute] 11:59:01
        Okay, I think we're gonna definitely have some dedicated time to to talk about that on on Wednesday.

        [Enrico Fermi Institute] 11:59:07
        I know Dale is gonna have a slide or 2 for us, and and maybe we move that question to Wednesday specifically, unless somebody wants to jump in right now.

        [Fernando Harald Barreiro Megino] 11:59:17
        okay.

        [Enrico Fermi Institute] 11:59:21
        comment from what it

        [Alexei Klimentov] 11:59:22
        okay, So my comment is related to comments from in and Paulo, where different comments.

        [Alexei Klimentov] 11:59:30
        So I can disagree, but we use clouds as Hpcs, so we use clouds on completely different ways.

        [Alexei Klimentov] 11:59:40
        This whole idea to try close. Is that what was written on this slide that we can elastically scaling resources.

        [Alexei Klimentov] 11:59:48
        So we can have this difference of resources, and we can build our own architecture at least.

        [Enrico Fermi Institute] 11:59:50
        Yeah.

        [Enrico Fermi Institute] 11:59:53
        Excellent.

        [Alexei Klimentov] 11:59:57
        But Greek. What we have, if especially with Lca.

        [Alexei Klimentov] 12:00:02
        Then you have boundary conditions. When this machine was built, as it was mentioned correctly, not for Hp.

        [Alexei Klimentov] 12:00:09
        But for some our domains, and for Paul or my colleague, we have a cloud.

        [Alexei Klimentov] 12:00:19
        What we have is many years of all experience. I don't think it is the right way to mirror.

        [Alexei Klimentov] 12:00:26
        our understanding of commercial companies to what we are doing with calls right now, so certainly they want to make money.

        [Alexei Klimentov] 12:00:34
        But we're not so stupid, and we are not so stupid to stop our tier.

        [Alexei Klimentov] 12:00:38
        2, and to use just calls, and the whole idea of the 15 months project of bottles is just to learn it better.

        [Alexei Klimentov] 12:00:47
        So I think we are on very early stage with clouds and understanding that you know cost model, and how it can be integrated with our agreed model.

        [Enrico Fermi Institute] 12:00:50
        Okay.

        [Alexei Klimentov] 12:00:59
        With my 2 comments

      • 11:00
        NERSC presentation 20m

        NERSC update / Workshop for the USATLAS-USCMS HPC/Cloud Blueprint

        Speaker: Wahid Bhimji (Lawrence Berkeley National Lab. (US))

        [Enrico Fermi Institute] 12:01:05
        So it is 11. We're going to have a short present presentation.

        [Enrico Fermi Institute] 12:01:11
        From? What are you out there

        [Wahid Bhimji] 12:01:13
        Yes. Hello!

        [Wahid Bhimji] 12:01:18
        Yeah, hold on. I'll just move my room right now.

        [Enrico Fermi Institute] 12:01:18
        yeah, okay.

        [Wahid Bhimji] 12:01:22
        I'm just gonna move into a meeting room

        [Enrico Fermi Institute] 12:01:25
        So you had your workshop. Now, where are you discussing this 11

        [Wahid Bhimji] 12:01:29
        just 10, Yes, Yeah, we don't quite that far ahead.

        [Enrico Fermi Institute] 12:01:30
        Oh, that's 10. I got the number

        [Wahid Bhimji] 12:01:36
        Yeah, So it's good good timing to have this conversation.

        [Wahid Bhimji] 12:01:39
        Actually So yeah. So I have a few slides.

        [Wahid Bhimji] 12:01:46
        I don't. I don't necessarily need to talk to them.

        [Wahid Bhimji] 12:01:50
        I wasn't sure if you wanted slides or not.

        [Enrico Fermi Institute] 12:01:56
        you want to share? Can you allow sharing, or the below

        [Enrico Fermi Institute] 12:01:59
        Are you allowed to share

        [Wahid Bhimji] 12:02:02
        Yeah, I think so. Well, it hasn't hang on.

        [Enrico Fermi Institute] 12:02:03
        Oh!

        [Wahid Bhimji] 12:02:05
        I'm just

        [Enrico Fermi Institute] 12:02:06
        Great

        [Wahid Bhimji] 12:02:09
        Let's see, Does that work? You see some window? Yeah, let's see if slide show mood messes it up.

        [Enrico Fermi Institute] 12:02:17
        Yes, yes.

        [Enrico Fermi Institute] 12:02:20
        Right.

        [Wahid Bhimji] 12:02:23
        so. I mean, this is actually just based on some slides I did show at the Cce meeting, or Debbie showed them.

        [Wahid Bhimji] 12:02:29
        So there's no particular news, hey, here, but just to share, like, just to set the context.

        [Enrico Fermi Institute] 12:02:34
        Thank you.

        [Wahid Bhimji] 12:02:36
        And then we can just talk about, you know, whatever you want to talk about.

        [Wahid Bhimji] 12:02:38
        I guess so. This is the current state. Oh, of nurse system.

        [Wahid Bhimji] 12:02:46
        So this shouldn't actually save P. One. Now we have the full of perimeter, both the a 100 accelerated Gpu notes, and Cpu only nodes.

        [Wahid Bhimji] 12:02:59
        but this is still not quite yet in production, as was mentioned briefly earlier, We do have some file system problems in the last stage of kind of upgrading them to use this new singshot high-speed, interconnect there's been a few snags I guess so but

        [Wahid Bhimji] 12:03:15
        those are being resolved, and I'd say it's probably within a month of being at the point of 40 fully available in production.

        [Wahid Bhimji] 12:03:25
        Kind of mode, and, as you probably know, we're gonna we've so far it's been in and early science kind of free, mode, where you don't have to use your allocation in in order. To use.

        [Wahid Bhimji] 12:03:35
        It. But that's coming soon, and then we still have Corey in production, and that is the the main production machine at the moment, and the goal is to retire that at the start of next Yeah, yeah, pending Pomato, actually being fully in production.

        [Wahid Bhimji] 12:03:58
        and so, yeah, there's just a comment here that assign that we do, you know.

        [Wahid Bhimji] 12:04:06
        Look at what user requirements are, while in order to get increase computing resources, you know, it is necessary to move to accelerated notes, as the only way we could offer the kind of increasing performance we need from this machine over the previous machine we do recognize that many communities are not ready for using Gpus for all

        [Wahid Bhimji] 12:04:26
        of their workload. And so that's why there are.

        [Wahid Bhimji] 12:04:29
        Cpu only nodes that actually provide all of the capable ability of Corey.

        [Wahid Bhimji] 12:04:35
        In the these notes. Okay, Yeah, So that's system.

        [Wahid Bhimji] 12:04:41
        This is a bit more of kind of where we're going.

        [Wahid Bhimji] 12:04:43
        We're only gonna have boma to. So there's a bit more detailed on the Cpu notes here, And then, just to say it was on the previous slide as well.

        [Wahid Bhimji] 12:04:51
        But as these file systems that made available, and also, we do put a kind of focus in having connections with external facilities, including other Hpc centers as well, as you know, science for facilities

        [Wahid Bhimji] 12:05:08
        Okay, And then there's been, you know, we've showed this many times that we had this super facility project, And this was really about trying to improve the engagement with data intensive workloads that also need workflow services running alongside that so we have an infrastructure that's

        [Wahid Bhimji] 12:05:23
        kubernetes based for services on the side we have.

        [Wahid Bhimji] 12:05:27
        You know, we put focus in things like Jupiter notebooks that can also run on the big machines, and we're really pushing for federated identity.

        [Wahid Bhimji] 12:05:35
        I mean, that's kind of rolled out now that you can use credentials from of the places to access Nesk.

        [Wahid Bhimji] 12:05:43
        Assuming you have an desk account now, so you kind of put the 2 and hopefully that will be pushed out, and that's come.

        [Wahid Bhimji] 12:05:50
        One of the months later. As part of this infrastructure, integrated research infrastructure task force which is trying to really get kind of cooperation across different centers for these these.

        [Wahid Bhimji] 12:06:04
        Things. So that's just the example with, you know.

        [Enrico Fermi Institute] 12:06:07
        Please.

        [Wahid Bhimji] 12:06:07
        Have type. Workflow Lz. They make you know that we are the primary center for them, and they only center in the Us.

        [Wahid Bhimji] 12:06:16
        So They really have to have all aspects of their workflow working well and desk, and takes a lot of engagement to achieve.

        [Wahid Bhimji] 12:06:26
        I guess this is saying, okay. So we we engage with scientists in lots of ways.

        [Wahid Bhimji] 12:06:32
        So there's a kneesap program, and you know, and listen.

        [Wahid Bhimji] 12:06:35
        Cms. Are both parts of that that help with, you know, can help provide resources to to help to new architectures, and also to explore Ai methods, which is really also a way of using Gpu resources Bo as well.

        [Wahid Bhimji] 12:06:51
        As having it same benefits in terms of transformative change to the way science works, And then we also have the superivity project that is trying to build more workflow stuff so in the future nose turn that I'm just mentioning We have a workshop about Now, internally that

        [Wahid Bhimji] 12:07:08
        we're trying to. It has achieved CD 0.

        [Wahid Bhimji] 12:07:10
        So that means there's a mission need for it. Then we're really putting together an Rfp.

        [Wahid Bhimji] 12:07:15
        Now, which will go out to vendors to kind of bid for a machine here to provide us with the machine.

        [Wahid Bhimji] 12:07:21
        So that's the stage. It's at and and part of the way this has been phrased.

        [Wahid Bhimji] 12:07:25
        The mission need is that that we need a machine to support workflow rather than just applications.

        [Wahid Bhimji] 12:07:32
        So I think that helps the experimental hep community as well, And then I briefly mentioned this thing: The integrated research, infrastructure effort.

        [Wahid Bhimji] 12:07:41
        That is another. Do we wide effort to to build workflow technologies and and support different sentences?

        [Wahid Bhimji] 12:07:51
        I guess this is just the there's 10 mission statement here.

        [Wahid Bhimji] 12:07:56
        Probably there's nothing new you there, and this is just staying again that we expect this machine to really stretch out into Es net and other places, and and provide, you know, way people can run stuff using data from outside.

        [Wahid Bhimji] 12:08:13
        then I just briefly wanted to mention the these. Yes, sure.

        [Enrico Fermi Institute] 12:08:16
        That's a good question on that slide. So that means essentially streaming.

        [Enrico Fermi Institute] 12:08:23
        Then also streaming in and streaming out

        [Wahid Bhimji] 12:08:25
        Yes, So that that comment was made earlier. And there are various use cases not just tep who want to do their including the light sources like you mentioned.

        [Wahid Bhimji] 12:08:37
        so we do anticipate supporting that better in principle.

        [Wahid Bhimji] 12:08:42
        It should be already much better on permanent than it was on Corey.

        [Wahid Bhimji] 12:08:44
        I mean, yeah, don't mention the problems we've had on Corey, which really are never being properly resolved Poem. It already.

        [Enrico Fermi Institute] 12:08:46
        Okay.

        [Wahid Bhimji] 12:08:53
        Should have better capabilities to do this

        [Enrico Fermi Institute] 12:08:59
        Okay, Great: Thanks.

        [Wahid Bhimji] 12:09:02
        Okay, this is just a couple of sides of context as well about you know, the landscape as a whole is getting increasingly challenging with heterogeneity in some ways, there may be advanced from this is the grayson and video grace hopper architecture which has cpus and

        [Wahid Bhimji] 12:09:20
        G. If you use with, you know with you know, access to memory with them.

        [Enrico Fermi Institute] 12:09:22
        Yeah.

        [Wahid Bhimji] 12:09:26
        So in in some sense, this could reduce on data movement costs, and so in some make this easier to program than current architectures.

        [Wahid Bhimji] 12:09:36
        But, on the other hand, this grace is a is an arm Cpu, so there's some, you know already, some differences.

        [Wahid Bhimji] 12:09:42
        There and then. There's also this move to its triplets.

        [Wahid Bhimji] 12:09:48
        Amd, for example, having all kinds of different calls on, there.

        [Wahid Bhimji] 12:09:50
        that's Dpu: So programming and network.

        [Wahid Bhimji] 12:09:54
        And then there's all these Ai hardware, specific architectures, and then a bit longer term There's the idea of processing in storage, and there's also a move we see on the nurse 10 time frame just kind of coming in towards disaggregation which

        [Wahid Bhimji] 12:10:09
        potentially allows more efficient use of resources. So this is the idea that you could have a a disaggregated memory pool which gives you increased memory capacity, but not on the note, So you would you'd be incorporating memory from outside the note but that means that people who need

        [Wahid Bhimji] 12:10:27
        much higher memory capacity would be actually be able to access that without I was having to buy kind of that in every single node So there's opportunities here.

        [Wahid Bhimji] 12:10:37
        But also quite complex landscape. And then you know, there's this rise of the cloud market that really is driving everything so you know, it's very lightly that so this is an opportunity of course because we can capitalize on all this investment going into cloud interfaces and so forth but it means

        [Wahid Bhimji] 12:10:56
        that you know we also have to recognize that in the kind of machines that we have access to so, and we can expect that these interfaces will become the standard way of accessing machine.

        [Wahid Bhimji] 12:11:11
        So. So this is also good. I think it means that if you use these cloud interfaces, then there's, you know, probably a good expectation that these should be what we.

        [Wahid Bhimji] 12:11:25
        Should we should definitely work with the other compute centers to make sure these are well supported at the very compute centers

        [Wahid Bhimji] 12:11:34
        and this is just one slide on, I mean, since this was the Hpc.

        [Wahid Bhimji] 12:11:38
        And Cloud workshop. I just thought about this like that. You know.

        [Enrico Fermi Institute] 12:11:39
        Yes.

        [Wahid Bhimji] 12:11:41
        We're already kind of using that. In there, as mentioned in the spin services that set on the side.

        [Wahid Bhimji] 12:11:46
        But we're increasingly seeing a tighter integration to the main system.

        [Wahid Bhimji] 12:11:51
        And so I expect on Nurse 10 there'll be an increasing ability to use cloud type interfaces to access the big supercomputing resources as well.

        [Enrico Fermi Institute] 12:12:04
        Okay.

        [Wahid Bhimji] 12:12:04
        No, it's good. At least. Okay. So I think that's all I really had.

        [Wahid Bhimji] 12:12:10
        This one is just about also data management. I think we see also having an increased role here, in the nurse town timeframe, which I think should also open its community But again, and then this is probably a general point without a thought as the discussion was going on earlier that we do have to cater for a very

        [Wahid Bhimji] 12:12:29
        wide community. So that's one of the maybe disadvantages we have compared to the leadership computing facilities that we do try to support different user communities.

        [Wahid Bhimji] 12:12:39
        But we have, you know, thousands of users and several 100 projects that have different needs.

        [Wahid Bhimji] 12:12:46
        Some of them are traditional. Hpc Center Hpc projects so they need, you know, tightly coupled, large scale resources.

        [Wahid Bhimji] 12:12:54
        some are more similar to experimental help, but have their own.

        [Wahid Bhimji] 12:13:00
        You know their own ways of doing things there like a little bit different to how experimental help is doing it And so we have to kind of come to some sort of balance of supporting all of these.

        [Wahid Bhimji] 12:13:12
        Okay, Okay, I think that's me. Yeah, any, any questions.

        [Enrico Fermi Institute] 12:13:18
        Thanks for you. I have one question. I think you mentioned that the nurse 10 is going to have a lot of you know, accelerators, for performance and things like that.

        [Enrico Fermi Institute] 12:13:29
        Yeah, what do you do? You guys have any feeling for what the mix will be?

        [Enrico Fermi Institute] 12:13:34
        Of of accelerators and cpus. And the next machine.

        [Wahid Bhimji] 12:13:39
        Oh, well, we don't! And we're having that discussion So so one thing and and this, this this other things that might come into play here as well, because so I mean, I think you can guarantee that there will be some gpus in this machine pretty pretty.

        [Enrico Fermi Institute] 12:13:41
        Okay.

        [Enrico Fermi Institute] 12:13:41
        Yeah.

        [Wahid Bhimji] 12:13:54
        Much realistically. That will be the most slightly, you know, generally usable accelerator.

        [Wahid Bhimji] 12:13:59
        That's right. Today. Then I mentioned there was these disaggregation technologies, and also several of the vendors are talking about like multi-tenancy, and so forth.

        [Wahid Bhimji] 12:14:10
        So. It is possible that you know one could run the Cpu only workload a lot died without any dedicated cpu, any nodes, so that would be a judgment on whether that technology really allows that and whether it that would provide sufficient, resources.

        [Wahid Bhimji] 12:14:31
        So those codes that was super gpu heavy, and accelerated would like leave enough of the Cpu, too, allow other jobs to run on there that you, Cpu, only but anyway, it is that certain part of the community even on the 2026 time scale won't be ready

        [Wahid Bhimji] 12:14:50
        for accelerated only so, you know, they will continue to be some Cpu resource, and then on the more exotic accelerators, I think it it is likely that we will have Yeah, Well, we will in the rfp have some place that people can which ai

        [Enrico Fermi Institute] 12:14:52
        Okay.

        [Wahid Bhimji] 12:15:10
        accelerators. For example, you know whether those are offering a significant benefit above Gpus, I don't think is yet clear.

        [Wahid Bhimji] 12:15:20
        A minute. I don't think they particularly are now, but they may do.

        [Wahid Bhimji] 12:15:24
        On the 2026 time scale. But there the Ai workload, you know, is currently not a very big fraction of what we're running, and so it would, you know, have to be sized, accordingly and I would say, on.

        [Wahid Bhimji] 12:15:38
        The integration with cloud. We're also looking at. As the point was made earlier, There's a huge variety of technology on the cloud, and even though we we tried to deploy cutting edge technology you know, Obviously, there quicker, to deploy various new technologies, so it.

        [Wahid Bhimji] 12:15:53
        May be that we can, You know, partner, with cloud providers to provide some of this capability for experiments and that particular workloads that need to run on different accelerators

        [Enrico Fermi Institute] 12:16:11
        But he would it be fair to say that we shouldn't expect significant scale up of the Cpu because I look at what Corey, the pro mode the Cpu basically stayed pretty much flat, one less because you the cpu fraction of parameters it somewhat equivalent of performance

        [Wahid Bhimji] 12:16:21
        Hmm.

        [Enrico Fermi Institute] 12:16:30
        to to what what we had on Corey. And just because of power budget reasons, I I wouldn't expect that, like 10 gives us 3 times the cpu.

        [Enrico Fermi Institute] 12:16:40
        That's in problem. I don't know. Probably most of yeah, okay.

        [Wahid Bhimji] 12:16:41
        right.

        [Wahid Bhimji] 12:16:42
        Right? Yeah, I think that's in terms of Cpu only resources.

        [Wahid Bhimji] 12:16:47
        I think that would be a reasonable expectation

        [Enrico Fermi Institute] 12:16:50
        Good.

        [Enrico Fermi Institute] 12:16:58
        Other questions for what you more short-term technical one.

        [Enrico Fermi Institute] 12:17:06
        So for data transferring out globus is not the be all at end.

        [Enrico Fermi Institute] 12:17:10
        All for Lhc. I know that there was some work to do, something with X room

        [Wahid Bhimji] 12:17:17
        Yeah, So that's still ongoing. I mean.

        [Enrico Fermi Institute] 12:17:20
        How's that? How's that going

        [Wahid Bhimji] 12:17:23
        Well, I I mean, I think yeah, we still working on it, right?

        [Wahid Bhimji] 12:17:28
        I mean, it's got a bit slower now, but I think we are trying to do that, and I think it would particularly of both Atlas and Cms can use the same interface and also other.

        [Wahid Bhimji] 12:17:39
        You know, help experiments, and even potentially, the light sources.

        [Wahid Bhimji] 12:17:42
        Then it's something worth us putting effort into support. I also think we need to.

        [Wahid Bhimji] 12:17:49
        So at the moment the spin kind of these containerized services haven't been like optimized for using for data management, services, but I think that's another thing that we should be able to support in the longer run that will add up people to run all kinds of different things on that side I mean globus

        [Wahid Bhimji] 12:18:08
        is for us the best, you know. Multi. It was supported by the most number of other communities that it's really worth us putting an effort into support.

        [Wahid Bhimji] 12:18:20
        But yeah, I do appreciate that. Not everyone uses it.

        [Wahid Bhimji] 12:18:23
        And so we do need other other things. I did have a brief chat.

        [Wahid Bhimji] 12:18:26
        I saw I am Foster actually conference a couple of weeks ago, and so that did have a brief chat with him.

        [Wahid Bhimji] 12:18:34
        I think power is always well, but about about ways we can maybe improve global and D kind of interoperation.

        [Wahid Bhimji] 12:18:43
        but that was no more than a chat at this point, but he seemed open to more discussions on that front

        [Enrico Fermi Institute] 12:18:52
        And it's probably not, for with this talk back going, we can chat later on that we're we're like.

        [Enrico Fermi Institute] 12:18:59
        Technically, we were stuck on Thursday things. But it's yeah for we can talk over.

        [Enrico Fermi Institute] 12:19:07
        One should be in time. Okay, other questions for anybody else.

        [Enrico Fermi Institute] 12:19:16
        Anybody on zoom

        [Enrico Fermi Institute] 12:19:24
        By the way, just to we had this plan for the afternoon, for the Hpc focus area but due to the ongoing workshop there was a little bit of a scheduling conflict here.

        [Enrico Fermi Institute] 12:19:35
        So we okay, alright.

        [Wahid Bhimji] 12:19:35
        Yeah, so I won't be around in the afternoon. So if you wanna attack me, you should do it now and but yeah, we'll be interested in also seeing the blueprint.

        [Enrico Fermi Institute] 12:19:42
        considering.

        [Wahid Bhimji] 12:19:45
        As well once you have it, or whatever, because I think that will help, you know, as was mentioned

        [Enrico Fermi Institute] 12:19:49
        I'm not. I mean the level. It's probably not gonna be fully public, but there might be a version of it that's going to be public.

        [Wahid Bhimji] 12:19:57
        Right? Yeah, I mean again for influencing kind of architectural decisions.

        [Enrico Fermi Institute] 12:19:57
        We'll have to see what

        [Wahid Bhimji] 12:20:02
        I mean, it's really when we're evaluating the Rfp.

        [Wahid Bhimji] 12:20:05
        And stuff that we can bring in these considerations

        [Enrico Fermi Institute] 12:20:09
        So? Are you looking at things like the very low power course like our?

        [Wahid Bhimji] 12:20:14
        Yeah, I mean, you know, in video, want to sell you this and this great hopper architecture now.

        [Wahid Bhimji] 12:20:22
        So they're setting arm cpus with the so at least with the Gpu.

        [Wahid Bhimji] 12:20:28
        So, at least for the Gpu accelerated notes.

        [Wahid Bhimji] 12:20:31
        If they're Nvidia, then it would they would be on, and they also sell Cpu only, or will do so

        [Enrico Fermi Institute] 12:20:39
        That's which

        [Wahid Bhimji] 12:20:43
         

      • 11:20
        Resource Acquisition and Operations 20m

        [Enrico Fermi Institute] 12:21:19
        So slides again. I need to share them.

        [Enrico Fermi Institute] 12:21:32
        Do you want to do these? Or here? I can go through them.

        [Enrico Fermi Institute] 12:21:37
        So one other thing questions on the charge was what metrics should be used to decide with a work for this executed efficiently, both in how we acquire the resources and and also then how we operate the work through zoom is it different is it efficient to get a certain resource to basically spent the effort to to get

        [Enrico Fermi Institute] 12:22:01
        it That's a cost to get it, and then actually to run one out, work through science and acquiring in this context means 2 things: one and like one, is is to actually get access to the resources which on Hbc: And Cloud we're talking about the proposals, and hpc competitive proposals, where you put

        [Enrico Fermi Institute] 12:22:22
        it. Usually this year at the moment, is his yearly proposals.

        [Enrico Fermi Institute] 12:22:27
        You have to follow a certain procedure of every Hbc.

        [Enrico Fermi Institute] 12:22:29
        Facility is different. The exceed slash access There's like an umbrella organization where you can ask for time on multiple facilities.

        [Enrico Fermi Institute] 12:22:37
        These in one proposal, but others are unique for you.

        [Enrico Fermi Institute] 12:22:43
        One facility, and on cloud is either you just pay a you go, pay whatever this price, or on demand, or spot, or preempt.

        [Enrico Fermi Institute] 12:22:53
        So whatever the instruct is called, but is publicly available.

        [Enrico Fermi Institute] 12:22:55
        Pricing to everyone If you show up the credit card you can get it, or like what athletes is doing right now.

        [Enrico Fermi Institute] 12:23:04
        A subscription based on a negotiation, basically commit, we commit to a certain amount of money, and you get a certain block of resources with limitations and rules, how you can use them and how things are going on and the second part of the acquiring part is the extra provisioning once someone gives you

        [Enrico Fermi Institute] 12:23:23
        the key. Basically, here are the resources you actually have to figure out.

        [Enrico Fermi Institute] 12:23:27
        How do you actually tire them into our systems so that you can make use of them?

        [Enrico Fermi Institute] 12:23:34
        So at the Hpc. Level. It's like things like batch, Hbc: badge, queues unit of provisioning a number of notes scheduled Our policies all of that comes into play because it's all different than what we I used to on our own resources.

        [Enrico Fermi Institute] 12:23:51
        That we own where we have a fixed quarter. We say you get 4,000. Course.

        [Enrico Fermi Institute] 12:23:55
        Okay, There might be a 20 four-hour way to test me back to other people.

        [Enrico Fermi Institute] 12:24:01
        But eventually, if you provide a stable, I'm basically a sufficient amount of work will always give you 4,000 courts.

        [Enrico Fermi Institute] 12:24:09
        That's different on the Hpc. It's like you don't have any guarantees there.

        [Enrico Fermi Institute] 12:24:14
        And then cloud cloud is less problematic in terms of provisioning, because you you pay over your money, but you still have depending on what pricing model and what rules you follow.

        [Enrico Fermi Institute] 12:24:26
        You can still have to deal with contention with insertion.

        [Enrico Fermi Institute] 12:24:29
        Certain regions side which depend on the size of the regions.

        [Enrico Fermi Institute] 12:24:35
        Activity of other customers, What certain instant types you request, and so on.

        [Enrico Fermi Institute] 12:24:40
        Yes, and then, once you have the resources and they are available, and you provision them and they're integrated.

        [Enrico Fermi Institute] 12:24:49
        Then you look at what metrics are interesting to determine whether you actually operate efficiently.

        [Enrico Fermi Institute] 12:24:56
        There standard one we use everywhere. Cpu efficiency gpu efficiency.

        [Enrico Fermi Institute] 12:25:05
        It's Basically, nothing. It's an open question. We We don't have anything that measures how efficiently we use the Gpu

        [Enrico Fermi Institute] 12:25:14
        On the cloud what it eventually comes down to is the dollar per event, or the dollar in paid per hsu 6.

        [Enrico Fermi Institute] 12:25:21
        You get. Hbc: there's no direct all right.

        [Enrico Fermi Institute] 12:25:27
        Cost associated. So the outlayers 0 in terms of monitory, but of course, is not free, and effort.

        [Enrico Fermi Institute] 12:25:36
        Then you look at overall utilization. So if you have a certain product number of cloud credits, or if you have a certain allocation size, are you using these up because he spent some effort to get them, so you should use them, like so subscription.

        [Enrico Fermi Institute] 12:25:52
        Model, If If Google lets you use 10,000 cores for free as part of the subscription, it doesn't make sense to only 1,000.

        [Enrico Fermi Institute] 12:26:00
        There is no benefit on a penalty for using up your phone.

        [Enrico Fermi Institute] 12:26:04
        Yes, the other thing is turnaround time. So now run Time, I mean, is provisioning.

        [Enrico Fermi Institute] 12:26:13
        Turn around this is, comes especially into Hpc: If you talk about Lcf: if you talk about unit of provisioning what's associated with that is also.

        [Enrico Fermi Institute] 12:26:26
        The latency is very. We are used to that with our normal grid operations.

        [Enrico Fermi Institute] 12:26:33
        I'll see if you ask for a 1,000 notes. You can wait and you have no idea when you're gonna get it.

        [Enrico Fermi Institute] 12:26:39
        Eventually you'll get it It's not under your control.

        [Enrico Fermi Institute] 12:26:44
        and for all these metrics is how are we gathering them?

        [Enrico Fermi Institute] 12:26:48
        Okay, I mean our own resources. We have services in place. We have many years of preparation, so they have to see.

        [Enrico Fermi Institute] 12:26:56
        Just get them to forward information, to collect it. Hbc.

        [Enrico Fermi Institute] 12:27:00
        Cloud is different, especially Hbc. On the cloud. You can run whatever.

        [Enrico Fermi Institute] 12:27:04
        But Hbc is problematic. You need to collect statistics from the best use from the job system and so on.

        [Enrico Fermi Institute] 12:27:13
        I'll be how you forward it. So it's actually collected in the right place, and you can compare it.

        [Enrico Fermi Institute] 12:27:19
        You have phones. Texas? We got a question. Yep.

        [Enrico Fermi Institute] 12:27:24
        yes.

        [Ian Fisk] 12:27:25
        yeah, I had a question which was about the concept of You have nothing about Gpu efficiency.

        [Ian Fisk] 12:27:31
        If you have nothing but deep efficiency, it's just you haven't asked the the the Gpus themselves monitor.

        [Ian Fisk] 12:27:38
        Their utilization. Very well. Command is Nvidia Smi. They will tell you how much the memory and how much the theoretical processing capacity you're using

        [Enrico Fermi Institute] 12:27:49
        I'm nuts. I'm not saying that is nothing that you can run on a Gpu to tell you to what degree it's utilized.

        [Enrico Fermi Institute] 12:27:56
        What I'm saying is, I don't think we have any tool where, with our Gpu workforce, where we actually record this information, and keep track of it.

        [Ian Fisk] 12:27:56
        Okay.

        [Ian Fisk] 12:28:06
        Okay. But but at this point, in the same way that you that you record the Cpu efficiency with top you should. Simply There are tools that do exactly the same thing for Gpu

        [Enrico Fermi Institute] 12:28:18
        And

        [Enrico Fermi Institute] 12:28:18
        Yes, it just needs to go. Put it place and put into demonic traces, and it's it's more an indicator of how early we are in terms of adoption of gpu workflows in the experiments.

        [Ian Fisk] 12:28:19
        Okay.

        [Enrico Fermi Institute] 12:28:33
        Then It is an indicator of the lack of low level 2.

        [Enrico Fermi Institute] 12:28:36
        So all of the tools are there. It's just a matter of spreading it all through.

        [Ian Fisk] 12:28:39
        Yeah, I just this: This is the like, I can tell you.

        [Ian Fisk] 12:28:43
        I send a lot of email a week about people who are not using the Gpus especially.

        [Ian Fisk] 12:28:47
        Well, and so I It's probably something that should go early on It's a monitoring system, because it it.

        [Ian Fisk] 12:28:55
        It's not like it's not it's not.

        [Ian Fisk] 12:28:56
        It's hard to get

        [Enrico Fermi Institute] 12:28:58
        One thing

        [Enrico Fermi Institute] 12:28:59
        One thing conceptually, that that is not quite as mature as you know what these different things mean to be.

        [Enrico Fermi Institute] 12:29:07
        You're comparing cross sites.

        [Enrico Fermi Institute] 12:29:12
        How do you compare 1080 versus a a 100, or what it being maybe that example, you just round the the 1080 down to 0 have the problem solve.

        [Enrico Fermi Institute] 12:29:22
        But but trying to aggregate and cross. Compare to, could it?

        [Enrico Fermi Institute] 12:29:29
        I mean eventually one of the things you're asking here is if I shooting my money's worth.

        [Enrico Fermi Institute] 12:29:36
        And to be asked in many different directions, including from the from the site.

        [Enrico Fermi Institute] 12:29:41
        I wanted to start. Do that. You start to accounting, and it's these sorts of things aren't as accepted as what

        [Ian Fisk] 12:29:45
        right, but I

        [Ian Fisk] 12:29:52
        Right, but they're like one of the reasons why we have Hs O.

        [Ian Fisk] 12:29:57
        6 was that we had a variety of cpus weren't sure what the performance was going to be between them and this in Benchmark and figure out the relative capacity each of the sites, and it's not intrinsically more difficult than that and just there there's a wider there's a much

        [Ian Fisk] 12:30:12
        wider variation in the performance of Gpus

        [Enrico Fermi Institute] 12:30:17
        We need a Hso. 6 for Gpus. Maybe

        [Ian Fisk] 12:30:20
        Maybe if you need just 26. But yeah.

        [Enrico Fermi Institute] 12:30:25
        Another thing is, once you collect this, it's it's not good to me that we know, I mean.

        [Enrico Fermi Institute] 12:30:29
        And Cpu efficiency. We kind of have an idea of what's bad I mean.

        [Enrico Fermi Institute] 12:30:34
        Usually it's bad when we get dinked by the review balls that look at our Cpu efficiency and tell us we you make bad use of the resources for you It's It's been. Unclear to me on the gpu side what is bad day I would say we don't have any clue

        [Enrico Fermi Institute] 12:30:50
        on the Cpu side, but we pretend we do. We have about the same amount of clue on We have even less, because from you know, architecture generation to architecture, generations.

        [Ian Fisk] 12:30:59
        right.

        [Enrico Fermi Institute] 12:31:05
        Of Gpus. Things are changing pretty wildly, and the performance basis is very, very different, for you know, a turing class chip versus Okay, So the conclusion is we need to learn to pretend to know what we're doing exactly we need to come up with, sufficiently.

        [Enrico Fermi Institute] 12:31:22
        Obsceneated language, so that we can sound like we know what we're talking.

        [Enrico Fermi Institute] 12:31:25
        About; the fact that it's 2,022 in our review boards.

        [Enrico Fermi Institute] 12:31:31
        Don't understand hyperthread. No, that suggests that 2,040, maybe we'll have Okay.

        [Enrico Fermi Institute] 12:31:46
        So I guess, or sorry. The 1 one thing I wanted to say is that I I, in terms of like even coming up with performance, benchmarks.

        [Enrico Fermi Institute] 12:31:55
        I don't know if it makes a lot of sense to compare how you're doing with a bunch of tennis versus how you're doing with a bunch of amperes or a a 100, or whatever, just because the way those peopleors work, is so different yeah that does

        [Enrico Fermi Institute] 12:32:10
        cut my point. Sorry Sorry we agree. Great! That's awesome.

        [Enrico Fermi Institute] 12:32:15
        you don't just tell me this next slide, but particularly when it comes to acquisition at some point, you have to feedback to the the powers that be how you spend the money whether it was funny money or real money and that starts to get into things like pledging and you know actually having

        [Enrico Fermi Institute] 12:32:35
        these resources effectively, more effectively, acknowledgment experiments.

        [Enrico Fermi Institute] 12:32:42
        Or we're gonna touch that third rail today, or like this is, there's an accounting, pledging discussion on on Wednesday morning, and part of benchmarking is part of that, because, yeah, as you said, the thing is this is Hbc as long as it's opportunistic.

        [Enrico Fermi Institute] 12:32:57
        no one cares. It's free resources. When people actually start pledging it.

        [Enrico Fermi Institute] 12:33:02
        Then then you get into comparing performance numbers, and are you meeting your pledge or you're not meeting your pledge, And then things like measuring these things correctly, or at least in the measuring them in the way that you come up with a defensible number Okay, but good one to 4 one interesting

        [Enrico Fermi Institute] 12:33:23
        thing I observe on the Wcg. Side this month is, especially as some some sites in Europe are saying that we, you know we can't keep.

        [Enrico Fermi Institute] 12:33:33
        The use running over the winter but we'd like to send, you know, have the same number of hours delivered, which, of course, courses not what we pledge on So I think there's gonna be more for interest and examining some alternate models where interest wasn't

        [Enrico Fermi Institute] 12:33:51
        before, But I I think we we really need to push.

        [Enrico Fermi Institute] 12:33:56
        having things like Hpcs. Quote a quote cap and not right right now.

        [Enrico Fermi Institute] 12:34:03
        The the value delivered officially to the experiments is rounded to 0, even though we know from the resource graphs, these are been delivered lots and months of in terms of events at that that's gonna break some point, and the fact that some of the traditional wcg

        [Enrico Fermi Institute] 12:34:25
        sites, or also hit the brakes on the old pledge Models might be as close as you could do in turn turn it into an option Yeah, The 2,021 see it the Crsg: and that the comp basically where we added, up what we actually delivered 2021.

        [Enrico Fermi Institute] 12:34:44
        Us Hbc was slightly above formula. Now, okay, there's the normalization.

        [Enrico Fermi Institute] 12:34:49
        Factors have larger Arabs. So it was basically comparable.

        [Enrico Fermi Institute] 12:34:53
        But but again right now at view from some angle, you're you're saying.

        [Enrico Fermi Institute] 12:35:01
        We delivered as much as Fermi lab did, but then the value was written to, 0 and because none of it quote official accounts.

        [Enrico Fermi Institute] 12:35:09
        And that's that's a problem. Now when the problem gets increased by bigger, Yeah, the question about the turnaround.

        [Enrico Fermi Institute] 12:35:16
        So the turnaround time with some set Hpc centers, allow you to reservations Where you plan ahead.

        [Enrico Fermi Institute] 12:35:26
        Does that change, Then some of these metrics, and then also the also the Is it also simple?

        [Enrico Fermi Institute] 12:35:35
        Simplify operationally the I can just tell you the experience we had.

        [Enrico Fermi Institute] 12:35:44
        We have used reservations for Cms and that's mainly for the reason that the type of work for we're sending does always work so we don't really care when it runs.

        [Enrico Fermi Institute] 12:35:57
        The I know that some of the neutrinos and science experiments, where they had a big specific production that you targeted in Hbc that they had scheduled They planned ahead makes perfect sense to do A reservation in that Scenario for us I don't see that it.

        [Enrico Fermi Institute] 12:36:14
        Would help as much, because the the turnaround time is not so much a problem in terms of

        [Enrico Fermi Institute] 12:36:24
        Basically not being able to plan work because most of our work we don't care if it runs this week or next week, or a couple of weeks later.

        [Enrico Fermi Institute] 12:36:31
        I mean, there's higher. There's high priority stuff, but we usually do it at soon, and then play with the prioritization.

        [Enrico Fermi Institute] 12:36:37
        The turnaround time is actually more an issue for us in our software stack, because the system is just not designed with with like a 2 week provisioning time, and all week provisioning the assumptions in there So this is more software.

        [Enrico Fermi Institute] 12:36:53
        Problem than then an actual work plan planning problem, movement.

        [Enrico Fermi Institute] 12:37:00
        It's still a useful metric to have, because it doesn't right.

        [Enrico Fermi Institute] 12:37:05
        So I mean, you cannot. If you need a week, you cannot put high priority stuff.

        [Enrico Fermi Institute] 12:37:08
        There that's that's relatively small invitation, because most of our work is not high priority, No matter if these things had, that's a different issue.

        [Enrico Fermi Institute] 12:37:19
        But most of the work is just just get it done. We come back later on a month and check that everything is done

        [Enrico Fermi Institute] 12:37:30
        Okay.

        [Enrico Fermi Institute] 12:37:34
        Other comments or questions from Zoom: Yeah, yes, Okay, there any questions that we didn't that we should be asking

        [Enrico Fermi Institute] 12:37:51
        I mean, just to go hit the the dead horse again. I think the the accounting porting of resources has to be a a top level item

        [Enrico Fermi Institute] 12:38:07
        That's got to be appropriate, zoom. So not a particularly interesting technical topic, Tv.

        [Enrico Fermi Institute] 12:38:18
        But that's okay. I mean, we we did talk a lot about this in in context of Hpc: Were there any specific comments folks wanted to make about this on cloud

        [Enrico Fermi Institute] 12:38:33
        We save some of that for the discussion tomorrow

        [simonecampana] 12:38:36
        sorry? Can I ask a question? This is the morning I It's following up on what Brian said.

        [Enrico Fermi Institute] 12:38:38
        Yep.

        [simonecampana] 12:38:43
        I think it would be interesting. In fact, if those resources that today are a bit special, they might not be in the future, could be accounted properly, which means basically being reported back.

        [Enrico Fermi Institute] 12:38:45
        Okay.

        [simonecampana] 12:38:57
        Who's that official accounting tools we use? Do you understand that?

        [simonecampana] 12:39:02
        So is. The problem is, I didn't get from the discussion is the problem technica.

        [simonecampana] 12:39:07
        It's well understood how to do it, but someone has to do the work because the vergeical ways, for example, of integrating Hpcs.

        [simonecampana] 12:39:16
        if you use an engine like, head cloud them, you can put some of the intelligence there, and the report upstream.

        [simonecampana] 12:39:24
        Your accounting records. But if you have something like a direct integration of the Hpc.

        [simonecampana] 12:39:29
        With the workload management system of the experiment. Like, for example, at the way Atlas is doing, you, You don't have that gateway.

        [simonecampana] 12:39:39
        You don't have that service you need in practice, Banda, or the workload management system, whatever.

        [simonecampana] 12:39:46
        That is for report upstream. So I think it's a good idea to look into that.

        [simonecampana] 12:39:51
        Do you have view of how to do it?

        [Enrico Fermi Institute] 12:40:01
        In terms of the typical pieces, I'm not so so worried. All right.

        [Enrico Fermi Institute] 12:40:04
        We've We've done this several times across multiple generations of technology.

        [Enrico Fermi Institute] 12:40:09
        So we we not like. It's the first time we we had to do an integration like that in the last 5 years.

        [Enrico Fermi Institute] 12:40:15
        Again. My My my worry is if we come in and say, you know, Oak Ridge delivered a 100 million Cpu hours to Atlas.

        [Enrico Fermi Institute] 12:40:29
        you know. How does that get contributed as part of a delivered resource to to the experiment? Yeah, how do we form I can say this, then, Does that help meet The us's commitments to the W Lcg: because right?

        [Enrico Fermi Institute] 12:40:46
        Now, it's it's a very different saying that.

        [Enrico Fermi Institute] 12:40:48
        Okay, a resource with W. To Gw: And that, did you?

        [Enrico Fermi Institute] 12:40:53
        Making that count, and that the official we have a touch that in 2 decades versus the technical mechanism to get an integer for pointing.

        [simonecampana] 12:40:59
        okay.

        [Enrico Fermi Institute] 12:41:04
        Yeah, we seem to reinvent that over 5 years or so.

        [simonecampana] 12:41:08
        No, I see. So it's basically a question of policy you are making, which is a good one.

        [Enrico Fermi Institute] 12:41:11
        Got it. It's

        [simonecampana] 12:41:14
        has to do a bit with the what the experiment considers the pledge resource, and a lot of the spgaments. We're considering the pledge of source something that they can use to run any workflow in a transparent way So I think, as long as one goes in this

        [simonecampana] 12:41:29
        direction you will get the buy-in from an experiment.

        [simonecampana] 12:41:33
        Otherwise there might be some discussions to have

        [Enrico Fermi Institute] 12:41:37
        Oh, I think there has to be some discussion, because you're I don't think for any, with an exception, Maybe the cloud.

        [Enrico Fermi Institute] 12:41:46
        And even then landscape is a good counter example, where it probably can't run any so.

        [Enrico Fermi Institute] 12:41:52
        but but to say that yeah, the experiment site a 1 billion Cpu hours, or again, just making up numbers on overage is worth nothing, because we can't run everything on there, I think, is pretty short-sighted but it is a very important discussion and you know what I find is policies that are

        [Enrico Fermi Institute] 12:42:14
        older. It's tend to be harder to update the fact that we have a really dug into this in 20 years means it's it's gonna take some effort to come to a a place where everybody is happy and feel that they're concerns are heard.

        [simonecampana] 12:42:34
        yeah, I have. I I see your point. I think there are a lot of problem.

        [simonecampana] 12:42:39
        Is that there is a large spectrum right? There are Hpcs that can be used almost for everything which you can say sort of a pleasant source.

        [simonecampana] 12:42:46
        There are Hbc's that can be used to run one generator.

        [Enrico Fermi Institute] 12:42:50
        Okay.

        [simonecampana] 12:42:51
        It's a bit short sighted to say that those are like any other facility.

        [simonecampana] 12:42:56
        So, and because the spectrum is broad is difficult to.

        [Enrico Fermi Institute] 12:42:59
        Please.

        [simonecampana] 12:42:59
        I agree with you is something that

        [Enrico Fermi Institute] 12:43:08
        Going back to the acquiring for use specifically for Hpc: So right now, you you mentioned that for Hpc.

        [Enrico Fermi Institute] 12:43:18
        Every you know, there are a couple of different kinds of proposals, suspicion types, and if it's leadership class or or user facility, so these still require proposals which each it each the border each year, or and and so forth, and and to me it seems like you can't, say anything you

        [Enrico Fermi Institute] 12:43:39
        can't tie that to any sort of pledge situation.

        [Enrico Fermi Institute] 12:43:43
        If you've got a proposal that has to be proved by a bunch of outside scientific, you know, Committee right?

        [Enrico Fermi Institute] 12:43:51
        There was something we had in affirmative action This I'm not on this.

        [Enrico Fermi Institute] 12:43:58
        But something came up. Lis mentioned something that they were high-level discussions within the agencies about better support for data sciences.

        [Enrico Fermi Institute] 12:44:08
        that was one of the area of discussion She didn't say anything, about.

        [Enrico Fermi Institute] 12:44:12
        At least there's discussions I don't know if it's to what extent that will go anywhere.

        [Enrico Fermi Institute] 12:44:19
        The one thing that with this here is nice proposal.

        [Enrico Fermi Institute] 12:44:24
        They've started out asking the last couple of years about special needs, and like multi year planning horizon.

        [Enrico Fermi Institute] 12:44:31
        And it seems this year they kind of they already know what they're going to give us for the next 3 years It almost sounded like the feedback but we still have to write a proposal, but they ask us to write a simple proposal so on the ns, on the nsf side we're

        [Enrico Fermi Institute] 12:44:47
        starting, to see, not not at like the biggest Frontera type skills.

        [Enrico Fermi Institute] 12:44:53
        Nsf: Start to give allocations as part of the yeah crypt proposal.

        [Enrico Fermi Institute] 12:45:02
        So if you want a proposal like Uscms ops it.

        [Enrico Fermi Institute] 12:45:05
        It comes with the allegation as opposed to having Soma other, you know.

        [Enrico Fermi Institute] 12:45:10
        Peer, review Committee. They basically do this as double jeopardy that they could give you the money and then have somebody else make you unable to. To.

        [Enrico Fermi Institute] 12:45:24
        Okay, So that that's beginning to go into the system, but not at the Us.

        [Enrico Fermi Institute] 12:45:33
        Lhc: Ops: skills. So you know, there's there's more than discussion.

        [Enrico Fermi Institute] 12:45:39
        There's actually a couple of examples of doing this at at modest scale, but not that the the biggest one

        [Enrico Fermi Institute] 12:45:51
        Not across the finish line, but you know it's starting to show up in solicitations and things like this.

        [Enrico Fermi Institute] 12:45:57
        We have provided this feedback to the to the age funding agencies.

        [Enrico Fermi Institute] 12:46:03
        Before I think of the 2,019 child meeting and counting, discuss there.

        [Enrico Fermi Institute] 12:46:10
        This is, it's difficult to write it a generic of okay, you know, for collaboration that has some broad mix of workflows that competitive against a specific.

        [Enrico Fermi Institute] 12:46:26
        You know scientific. They get compared by scientific marriage, right?

        [Enrico Fermi Institute] 12:46:31
        And so, and they're looking for specific specific outcomes like, What did you discover on this machine?

        [Enrico Fermi Institute] 12:46:39
        Because you know, because we awarded you this this allocation, and that's difficult.

        [Enrico Fermi Institute] 12:46:47
        If you're saying, Ok, we ran, you know, just a you ran a generic mix of simulation experiment for 2 that at least, on the Nsf. Side.

        [Enrico Fermi Institute] 12:47:03
        This is where they are looking to tie this to? Yeah, but you know you get get the allocation as part of the Usf Us or Usms operations It's it's just not you know obviously Then again time for this proposal, and rounding and still scaling up I think this has

        [Enrico Fermi Institute] 12:47:27
         to be addressed I don't know if this is in this book.

        [Enrico Fermi Institute] 12:47:30
        Prep process. You know, we need to have much dedicated to this, but this is an issue, I mean, this is

        [Enrico Fermi Institute] 12:47:42
        Spend a lot of time writing and writing proposals, and they get reviewed by by a committee.

        [Enrico Fermi Institute] 12:47:47
        And yeah, the Lcf proposals. Basically, you have to dress another.

        [Enrico Fermi Institute] 12:47:52
        You have to do. You have to. Basically, we actually tried that because we did 2 proposals this year. One was Gpu.

        [Enrico Fermi Institute] 12:47:59
        We construction on summit, and that was approved. Because if something new is something we haven't done before, so the other one we intentionally kept it.

        [Enrico Fermi Institute] 12:48:12
        We didn't dress it up. That was General Monte Carlo production on theta like.

        [Enrico Fermi Institute] 12:48:17
        Get us get some resources increase, do like 10% extra. Well, just stand up for the color production that was rejected.

        [Enrico Fermi Institute] 12:48:25
        And that's basically what I It was. I expected that because it's not exciting.

        [Enrico Fermi Institute] 12:48:32
        It's something you can do everywhere. They look at it, they say, Why are you on the Lcf.

        [Enrico Fermi Institute] 12:48:37
        You can do this, Okay, somewhere else And that's the tension with the with the pledged allocation where the pledge is supposed to be able to do everything. But and again. I think that's why it, has to be a major outcome.

        [Enrico Fermi Institute] 12:48:49
        Report. There is we We have to make the agencies realize experiments that the global collaborations are write down their contributions to $0 and 0 cents because they can get some of these.

        [Enrico Fermi Institute] 12:49:06
        You know, in a way that we can actually plan and get them to

        [Enrico Fermi Institute] 12:49:14
        And part of it's gonna be a shift on on the Wc.

        [Enrico Fermi Institute] 12:49:17
        2 side, I think. But I I think it. We also have to kind of throw some cold water and the agencies to make them real.

        [Enrico Fermi Institute] 12:49:25
        Wake up and sit up and realize. Oh, I'm not getting my my credit for the money, and because that, you know effectively, they're putting in money.

        [Enrico Fermi Institute] 12:49:32
        Have a getting no credit for the money, and he should be.

        [Enrico Fermi Institute] 12:49:39
        I'm mad about it, and but what we've talked about yeah, 5, 6 years, and not lunch Let's move on to the last slide before reporting future workflows.

      • 11:40
        Discussion 20m


        Future workflows and discussions


        [Enrico Fermi Institute] 12:49:53
        just looking forward. What we need to restrict kind of clothes that we run on clouds on Hpcs when we want to.

        [Enrico Fermi Institute] 12:50:00
        5 years from now. Will it make sense for us to partition our workflows?

        [Enrico Fermi Institute] 12:50:05
        Will we be able to expect Hpcs to just run all types of jobs Clouds be able to do that?

        [Enrico Fermi Institute] 12:50:11
        It sounds like, for clouds, The answer is kind of yes already, but that remains to be seen.

        [Enrico Fermi Institute] 12:50:16
        If the Hpcs will be able to do that, you know what what technologies, features, or policies are needed, you know.

        [Enrico Fermi Institute] 12:50:25
        Are there any capabilities provided by Pcr. Cloud that would allow us to run more clothes that we can't learn other places right?

        [Enrico Fermi Institute] 12:50:33
        and start to stiple some ideas here, but we will have some further discussion in the Rna section.

        [Enrico Fermi Institute] 12:50:40
        yeah, have a cloud. It seems like we can basically run whatever we want.

        [Enrico Fermi Institute] 12:50:43
        But we're limited by cost, and we can talk about that more in the yeah and the focus area.

        [Enrico Fermi Institute] 12:50:49
        But yeah, is open for restricted to the fact that there's sort of a because it's obviously easier.

        [Enrico Fermi Institute] 12:51:00
        If you can run everything, it's better, but you know, maybe we really should think sometimes for some machines should be different workflows, because that's what they're designed for.

        [Enrico Fermi Institute] 12:51:15
        But everything's not a now. It's a balance, because if you restrict it too much, it's completely on and and uninteresting for the experiment, and you will never be able to fetch it

        [Enrico Fermi Institute] 12:51:28
        I mean, if you want to fetch it, it has to be something that can run the majority of what you're doing, Otherwise I expect.

        [Enrico Fermi Institute] 12:51:37
        I mean, we discussed this over the weeks. I mean, we got the comment.

        [Enrico Fermi Institute] 12:51:42
        If it's only can run a generator, it might still th that will probably get your your proposal through.

        [Enrico Fermi Institute] 12:51:47
        But you're not going to get the hours credited

        [Enrico Fermi Institute] 12:51:52
        Most I mean at least not easily. Maybe that's one of the outcomes from this to push towards the actual Wlcg.

        [Enrico Fermi Institute] 12:52:02
        Sam on it. Hope you're hearing this so that we should get credit.

        [Enrico Fermi Institute] 12:52:08
        We should work towards the thing. We're useful, Useful computation gets credit, no matter what it is, but the use for the the pledging comes before the useful computation. If the resource is limited, that makes it less useful, resources you kind of basically I see the argument.

        [Enrico Fermi Institute] 12:52:28
        That when you go in I have this allocation 100 million hours on the Hpc.

        [Enrico Fermi Institute] 12:52:34
        Center, that and I can run one generator and then there's the tier one side.

        [Enrico Fermi Institute] 12:52:38
        I have 100 million hours equivalent over the whole year, or the allocation period.

        [Enrico Fermi Institute] 12:52:44
        It's worth more It's worth more to the experiment.

        [Enrico Fermi Institute] 12:52:47
        And I see that point

        [Enrico Fermi Institute] 12:52:50
        Okay, yeah, But again, we're where we are is right now.

        [Enrico Fermi Institute] 12:52:55
        We're we're saying it's Worth was right

        [Enrico Fermi Institute] 12:53:01
        Worthwhile that I I hope to get a 1 billion dollars a mad graph.

        [Enrico Fermi Institute] 12:53:05
        I was 1 billion hours of that front front somewhere, because I hope is that it enables something for experiment, or it awful lot from back one which is flexible.

        [Enrico Fermi Institute] 12:53:18
        So so do we need to be where we need to be pledging a different quality of service level.

        [Enrico Fermi Institute] 12:53:25
        I I don't know. I I I think it doesn't have to be for this blueprint, but I think somebody actually needs to step up and provide a proposal.

        [Enrico Fermi Institute] 12:53:37
        That people can disagree with. But but actually, somebody at some point needs to do some writing to say, Here's a model, I think, is useful, and and be able to so willing to think criticism right.

        [Taylor Childers] 12:53:49
        Isn't this a void? I mean this The larger question is, how do you make more?

        [Taylor Childers] 12:54:01
        the L. The Lhc workflows compatible with modern architectures. Right?

        [Taylor Childers] 12:54:07
        I mean, and of course I understand all the hang ups there.

        [Taylor Childers] 12:54:12
        I'm just saying that we can talk about what architectures aren't working for the HD.

        [Taylor Childers] 12:54:22
        Community. As long as we want, but we also need to be moving our our software in a direction that makes it easier to approach different hardware, cause I mean, it's just gonna get worse before it gets better you're going in in their own direction.

        [Enrico Fermi Institute] 12:54:41
        Yeah.

        [Taylor Childers] 12:54:45
        With hardware The Jones going in their own direction with hardware.

        [Taylor Childers] 12:54:49
        The Us is probably, gonna, I assume, continue with the us manufacturers.

        [Taylor Childers] 12:54:55
        okay, for political reasons, and of course, the Chinese are all developing their own and plan on having tons of compute power available.

        [Taylor Childers] 12:55:04
        So it's it's really question of Why can't we move in that direction?

        [Taylor Childers] 12:55:11
        And of course I I think we all know those answers.

        [Taylor Childers] 12:55:13
        But it maybe needs to travel up the up the chain.

        [Taylor Childers] 12:55:19
        One

        [Enrico Fermi Institute] 12:55:22
        Contract. It was

        [Kaushik De] 12:55:26
        yeah, right. Wanted to make a few comments about this this.

        [Kaushik De] 12:55:36
        I mean, it's not that it is useful for experiments to get access to resources.

        [Kaushik De] 12:55:46
        That may not be globally useful, and provide the value for particular workflows.

        [Kaushik De] 12:55:56
        I mean, we have the tools to make use of resources like that.

        [Kaushik De] 12:56:00
        Assuming we are not spending years of development and operational effort to to 2 use the resource, I think there's nothing wrong with having specialized resources as long as they're easy to use.

        [Kaushik De] 12:56:17
        I mean the experiments know how to use them.

        [Kaushik De] 12:56:19
        I think the the question is, how do you assign a value to that resource?

        [Kaushik De] 12:56:28
        I mean clearly that resource. You using the example that this is given, comparing, you know, a 100 million hours at Hbc.

        [Kaushik De] 12:56:38
        That only runs generators versus 100 million hours at 50 year, one that can do everything for the experiment.

        [Kaushik De] 12:56:43
        Clearly the 2 things are not the same, so the question is, how how do we assign different values to those 2 different kind of resources?

        [Kaushik De] 12:56:50
        And I think that is the real challenge for this working group.

        [Kaushik De] 12:56:55
        I mean, that's what we really need to come up with with our out of this workshop is is how do we assign a fair value to one more?

        [Kaushik De] 12:57:03
        Is the other

        [Enrico Fermi Institute] 12:57:09
        That's a great way to maybe, break for the for lunch break, and we have the Hpc focus area where we have more time to go into some of these things in more detail, and we'll have more slides prepared to cover some of that one thing I just want to mention before we closed the

        [Enrico Fermi Institute] 12:57:25
        framework. Developments are specifically supposed to be outside the scope.

        [Enrico Fermi Institute] 12:57:31
        I mean, we'll touch it a little bit, because it's some of the sets the scope of what's usable and what's not.

        [Enrico Fermi Institute] 12:57:36
        But we don't want to go into that Yeah, we have to partition somewhere.

        [Enrico Fermi Institute] 12:57:39
        Yeah, please, let's not go design to do that themselves.

        [Enrico Fermi Institute] 12:57:52
        You're from actually the Hbc. People like And I'm gonna go, Okay, So we'll we'll break for an hour.

        [Enrico Fermi Institute] 12:58:04
        We'll be back at one o'clock us Central time, and we'll do the Hbc.

        [Enrico Fermi Institute] 12:58:10
        Focuser.

        [Enrico Fermi Institute] 12:58:12
        See, everybody, then

        [Andrew Melo] 13:00:38
        everybody I had to pop out for a second, or we don't schedule the at one Pm.

        [John Steven De Stefano Jr] 13:00:48
        scheduled to reconvene in 1 h to oh, 2 Pm.

        [John Steven De Stefano Jr] 13:00:53
        Here in Eastern

        [Andrew Melo] 13:00:56
        Gotcha. Okay, So we're on schedule.

        [Andrew Melo] 13:00:57

    • 12:00 13:00
      Lunch Break 1h
    • 13:00 19:05
      First Day Afternoon: HPC Focus Area

       

      AFTERNOON SESSION

      (Eastern Time)

       

       

      [Enrico Fermi Institute] 14:00:40
      we're just getting back into the room here and getting started again.

      [Enrico Fermi Institute] 14:00:44
      So now we're starting the Hpc focus area.

      [Enrico Fermi Institute] 14:00:49
      Block: Yeah, thanks. Yeah, So we can jump right into it here.

      [Enrico Fermi Institute] 14:00:55
      Okay, see, the people are, are rejoining inside. Yeah, So this afternoon we have the Hbc focus there.

      [Enrico Fermi Institute] 14:01:04
      So we we already did quite a bit of discussions, but the hope is that we kind of maybe go a little deep on certain type of topics, and we also have some maybe some questions and points for discussion that Rand brought up yet So this is just a redo on maybe a little bit deeper than than on the

      [Enrico Fermi Institute] 14:01:22
      introduction slide on? What? Basically what? We're targeting And the separation of the user focus facilities and Lcfs: Maybe one thing here on the user focus facility that maybe has an been discussed a lot is where this is going for the Nsf funded Hbc: if they stay

      [Enrico Fermi Institute] 14:01:50
      on Cpu only, or whether they will also follow the transition to Gpu, because so far they're pretty much follow their users.

      [Enrico Fermi Institute] 14:02:02
      They have a few gpus on the side for training and and test out, but it's usually it's not the bulk of the facility and nurse has made that switch with the transition, phone court to armada.

      [Enrico Fermi Institute] 14:02:15
      So we have to worry about the same switch happening in the Nsf facilities at some point they have the same power constraints, probably not because they're smaller facilities, but I mean they're also they're also getting larger.

      [Enrico Fermi Institute] 14:02:37
      Right, do you have any input on that question which question the What about the next generation of Nsf Funded: Hpc: Do we have to worry about making the transition?

      [Enrico Fermi Institute] 14:02:46
      To Gpu to stay on Cpu and follow with him. Oh, users, there's always gonna be big big hunk in Cpu machine.

      [Enrico Fermi Institute] 14:02:56
      so I don't think Andville expanse or or any sort of outfiters.

      [Enrico Fermi Institute] 14:03:06
      Okay, bye past that. You know it comes. Question was like, Do you believe what Nsf.

      [Enrico Fermi Institute] 14:03:14
      Spend authorized by Congress? Or do you believe what they've been appropriated by Congress?

      [Enrico Fermi Institute] 14:03:19
      So some of the big expansions, you know, would allow a leadership class facility on the Msu side, and that would be for a lot of the same reasons on the Ue.

      [Enrico Fermi Institute] 14:03:39
      Side very on. The other end. So if you, can't believe that's that's done, and it's gonna happen then, Yeah, there's there's gonna be a big honking, heavy Gpu: machine.

      [Enrico Fermi Institute] 14:03:49
      But I I don't think that that's going to be.

      [Enrico Fermi Institute] 14:03:54
      In addition to the other tapes. Resources they always they have I mean the the the big machine that they have right now is from town.

      [Enrico Fermi Institute] 14:04:02
      That's all. CD It's it's very, very.

      [Steven Timm] 14:04:04
      great if you look at their website. Yeah, if you look at the cat website, there is also zoom We are about our leadership class facility machine coming to I don't.

      [Enrico Fermi Institute] 14:04:06
      It's not a leadership plus

      [Steven Timm] 14:04:19
      Think they say one it's going, but they say it's coming

      [Enrico Fermi Institute] 14:04:22
      Yeah, So they they've gotten authorization to do science studies.

      [Enrico Fermi Institute] 14:04:26
      And you know, they're they're doing all the kind of energy gathering to do such a thing. But at some point somebody has to come up with a slug of money, and I think if what Congress has authorized the Nsf has sufficient slow the money because they're total budget goes up

      [Enrico Fermi Institute] 14:04:45
      by 20. But Congress, at least in 2,022 has not actually given them the money.

      [Enrico Fermi Institute] 14:04:54
      So that's why I kind of. That's where it gets into crystal ball or anything you can lose your your whole afternoon to try to guess what funding agencies are going to do.

      [Enrico Fermi Institute] 14:05:01
      So I I wouldn't suggest doing that. But you know again, the short version is, I I personally believe that there's always going to be some sort of heavy Cpu resources, because they are wildly popular within and Nsf: there are going to be Gp: resources.

      [Enrico Fermi Institute] 14:05:18
      So all the Gpus that are. Oh, I guess you have.

      [Enrico Fermi Institute] 14:05:21
      Britain's too, but it's gonna be a very balanced, based on the user.

      [Enrico Fermi Institute] 14:05:26
      Community Yeah, the thing that might change would be different or grow is whether or not you believe this tack leadership facility? Good.

      [Steven Timm] 14:05:34
      good.

      [Enrico Fermi Institute] 14:05:35
      Okay, So we will The one. The question, though.

      [Ian Fisk] 14:05:36
      oh!

      [Ian Fisk] 14:05:41
      Bye, I wanted to mention a couple of things, Expanse Is not that big expanse night expense is 90,000 cores which makes it like a 10 of the Wsg It's not it's a it's it's far from a leadership class machine

      [Enrico Fermi Institute] 14:05:50
      Yeah.

      [Steven Timm] 14:05:51
      Indeed

      [Ian Fisk] 14:06:04
      and I I think the thing that. And if you look at where Nfsf.

      [Ian Fisk] 14:06:08
      Has spent their money. They've also spent their money on really exploratory things, like like voyager, which is an Ai.

      [Steven Timm] 14:06:13
      Yeah.

      [Enrico Fermi Institute] 14:06:14
      What is they? Have an arm, chess bed, Stony Brook right now.

      [Ian Fisk] 14:06:15
      Yeah, And yeah, yeah, they have the So Japanese name.

      [Enrico Fermi Institute] 14:06:20
      or commi. I think

      [Ian Fisk] 14:06:21
      yeah, And so they've also spent some money in exploratory things.

      [Ian Fisk] 14:06:27
      And my guess is that Brian's right in the sense that they will Nsf is a little bit more in tune to what people are using, But you could imagine that, like that could change and as people figure out How to use alternative machines that like the Gpus in addition to having a lot more processing

      [Steven Timm] 14:06:29
      Yeah.

      [Ian Fisk] 14:06:45
      power are a lot more processing power per block that becomes important to people like that then there'll be pressures there, too.

      [Enrico Fermi Institute] 14:06:48
      Yeah.

      [Enrico Fermi Institute] 14:06:54
      Yeah, that that's I. I guess the point I was making is Nsf.

      [Enrico Fermi Institute] 14:06:59
      Is very attuned into the user base. 5 years from now the user base is screaming for Gpus because machine learning has eaten the world.

      [Ian Fisk] 14:07:09
      right.

      [Enrico Fermi Institute] 14:07:10
      Then then you're gonna see a much stronger, and and under, even if if that doesn't happen, I don't get the impression that there's a lot of growth opportunity even at Nsf: Funded Cpu: Hbc: Yeah, it's a little bit organic growth.

      [Enrico Fermi Institute] 14:07:27
      I mean the bridges choose faster than bridges, and expanses a bit more fast than in common.

      [Steven Timm] 14:07:27
      Great

      [Enrico Fermi Institute] 14:07:32
      But it's not a magnitude, but it's not.

      [Enrico Fermi Institute] 14:07:33
      It's not. They don't like double or triple the capacity from step to the left

      [Steven Timm] 14:07:36
      Great

      [Steven Timm] 14:07:40
      This is a question. I'm not sure if you're gonna come to it later in the thing.

      [Steven Timm] 14:07:44
      If something was too early to ask. But you see, or even more Cpu, that you need, and existing leadership class facilities are not going to grow with them much.

      [Steven Timm] 14:08:00
      During their time, Your location on them is that we can grow that much by that time, and but you had our national web.

      [Steven Timm] 14:08:07
      They're not buying more because strateg strategically, seeing we're going to the we're going to the leadership class facilities.

      [Steven Timm] 14:08:16
      We we're we're seeing it because but there's a gap there's going to be a gap of between 50 and 70% of the resources you need are not going to be there.

      [Steven Timm] 14:08:26
      This is The projections are very. You can done Hpc's not gonna solve the whole problem.

      [Steven Timm] 14:08:30
      They're not enough of them good if you guys at all.

      [Enrico Fermi Institute] 14:08:34
      Hmm. I mean if you, if you can use this, the the Gpu, and that gets to the second point We have the Lcf.

      [Steven Timm] 14:08:43
      Yeah, yeah.

      [Enrico Fermi Institute] 14:08:44
      Where I'm going a little bit into the Lcf.

      [Enrico Fermi Institute] 14:08:46
      Landscape, and then we discussed a lot of that already in the morning session.

      [Enrico Fermi Institute] 14:08:50
      But one thing is the trend trick. To accelerate us.

      [Enrico Fermi Institute] 14:08:56
      if you look at what's there in terms of cpu, that's usually significant.

      [Enrico Fermi Institute] 14:09:01
      Most of it is on the Gpu side which we can't really use effectively right now for the but there's a lot of cpu there, and what's in my mind what's an open question I think is what's the threshold for being able to use these machines what's

      [Enrico Fermi Institute] 14:09:19
      good enough in terms of Gpu. Use utilization.

      [Enrico Fermi Institute] 14:09:24
      I don't know the answer to that. I know that very early on when that move started to happen, it was a state that was statements that I heard from people that were meetings with the agency that they say Oh, you have to use these full-on gpu utilization or you're not going to

      [Enrico Fermi Institute] 14:09:42
      get allowed on the machine, and that's softened significantly over time.

      [Enrico Fermi Institute] 14:09:46
      But still, I mean, there's there's the 2.

      [Enrico Fermi Institute] 14:09:50
      There's 2 sides. One is this: What do we need to do to get a proposal through?

      [Taylor Childers] 14:09:56
      sure, sure.

      [Enrico Fermi Institute] 14:09:57
      What? How much do we need to use the Gpu?

      [Enrico Fermi Institute] 14:10:00
      So we don't feel ashamed of running on these resources ourselves.

      [Enrico Fermi Institute] 14:10:05
      there's a certain point where it's just ridiculous, even if they would allow us to run that right?

      [Enrico Fermi Institute] 14:10:10
      So we have a question coming from problem

      [Paolo Calafiura (he)] 14:10:12
      it's it's it's a comment. Really.

      [Paolo Calafiura (he)] 14:10:17
      I I keep hearing this, the problem framed in this way, not only here, but you know in Atlas a lot even more than here.

      [Paolo Calafiura (he)] 14:10:26
      Probably like all darn and the the Hpc. Community is making this move to Gpu.

      [Paolo Calafiura (he)] 14:10:32
      They Are losing all of their users? I I don't have a precise data, but my understanding and adoptically is that today, if you want to run on a Gpu Node on parameter you have to wait hours, so the then the we are we are legged, okay, the new

      [Enrico Fermi Institute] 14:10:47
      Yes.

      [Enrico Fermi Institute] 14:10:52
      Yeah.

      [Paolo Calafiura (he)] 14:10:54
      communities. They have no problem whatsoever in using accelerators.

      [Paolo Calafiura (he)] 14:10:59
      So we have a choice. Either we either. We become like banks, We keep planning our Ibm V.

      [Paolo Calafiura (he)] 14:11:05
      Three-seven, and call ball, or and we are fine, you know we have the money to do it, and we accept the physics limitation that come with it.

      [Paolo Calafiura (he)] 14:11:16
      Or we jam. I think this. The you know, framing the problem like, Yeah, maybe Nask is gonna give.

      [Paolo Calafiura (he)] 14:11:23
      I mean, next is gonna give us what we have now presumably for the lifetime of per matter.

      [Paolo Calafiura (he)] 14:11:29
      That's about 1%. Oh, that's a a simulation.

      [Paolo Calafiura (he)] 14:11:33
      I know the the outlaws numbers. I don't know the others.

      [Paolo Calafiura (he)] 14:11:36
      I mean is it? It It's nice to have it.

      [Paolo Calafiura (he)] 14:11:40
      But is it? Is it worth having a workshop? About 1%, you know, as multi or 2?

      [Paolo Calafiura (he)] 14:11:45
      I think we I think we either. We make the the see that we make the jump, or or we are.

      [Paolo Calafiura (he)] 14:11:53
      We just step out and we say, Look, we will use our legacy cpus, and then perhaps for ram 5, when I'm retired, or worse, we will, use Whatever architecture is is he's so about that so I I I think we're framing the problem.

      [Enrico Fermi Institute] 14:12:06
      But

      [Paolo Calafiura (he)] 14:12:11
      The problem in us slightly wrong way, and I know that I know that there are other slides discussing the discussing accelerators and whatnot.

      [Paolo Calafiura (he)] 14:12:21
      But yeah.

      [Enrico Fermi Institute] 14:12:23
      But but, Apollo, that the jump it's not going to be a jump to the top in one.

      [Enrico Fermi Institute] 14:12:27
      Go We're going to jump up one step, and then we might.

      [Enrico Fermi Institute] 14:12:30
      We can jump up the next step, and so on, and and for that to get to that first step.

      [Enrico Fermi Institute] 14:12:36
      That's basically my question, Because

      [Ian Fisk] 14:12:37
      right. But I think Dirk would probably say, which I agree with is that I think we at some point we have to commit, that we are going to make, that this is a step we're going to make that we're going to succeed at this and We can define what success.

      [Ian Fisk] 14:12:51
      Looks like, but we sort of have to like it. Says you're going to do this, and I think, and you I think you have to say that because like to first order, all of the processing is in these machines the other thing is, I think we're actually not as far as we think like

      [Enrico Fermi Institute] 14:12:54
      Yeah, I mean.

      [Ian Fisk] 14:13:06
      atlas, and not Atlas Cms. At least.

      [Ian Fisk] 14:13:10
      LCD. Are all using Gpus in the online right now. Running software.

      [Ian Fisk] 14:13:13
      They wrote, We're not that far away, and I think the you can define whatever sort of metric that you want.

      [Enrico Fermi Institute] 14:13:14
      Okay.

      [Ian Fisk] 14:13:20
      But my guess is that a few algorithms that show that the thing is faster with the Gps than without enough to sort of get you in the door

      [Enrico Fermi Institute] 14:13:28
      But yeah, that's that's that was my question.

      [Enrico Fermi Institute] 14:13:30
      I think that. And I agree with the with the answer. I just wanted to phrase it as a question, because I know there are disagreements about that. And there, are also statements from the people that fund these machines that years ago that were different than that

      [Ian Fisk] 14:13:40
      Alright, and I think the and one of the things that we have to be a little bit careful of is that you can be a victim of your own success here, like if you take advantage of the accelerated resource.

      [Ian Fisk] 14:13:51
      And the process. The time for reconstruction of the tracker and Cms goes up by a factor of 10.

      [Ian Fisk] 14:13:56
      Like We do not have an Io system that's designed to handle twice, 10 times the data going in

      [Enrico Fermi Institute] 14:14:05
      There's a comment from Eric

      [Eric Lancon] 14:14:09
      yes, I wanted to go back on what? The power and yeah, make sure.

      [Eric Lancon] 14:14:17
      And I believe that are 2 topics which are mixed here.

      [Eric Lancon] 14:14:21
      It's accelerators and Hpcs.

      [Eric Lancon] 14:14:27
      So as mentioned by Yan with the code radio will be ready by almost of the experiments by necessity, to for using accelerators.

      [Eric Lancon] 14:14:40
      So nothing prevents classical sites to Well, further. Accelerate us as a resources for the experiment.

      [Eric Lancon] 14:14:51
      No the use of the big H species he is supposed to to to Hmm!

      [Eric Lancon] 14:15:01
      Hmm to address the lack of cpus rapidly moving forward for eigenvectors

      [Enrico Fermi Institute] 14:15:12
      Okay.

      [Eric Lancon] 14:15:16
      Is the missing factor as big as we believe. That's what we have to understand.

      [Eric Lancon] 14:15:23
      Because do we need to use H. Pc. Or not? The read question to complement the classical resources beyond the standard extra operation? It's not so clear.

      [Eric Lancon] 14:15:34
      That's really really need the the big Hpc.

      [Eric Lancon] 14:15:43
      For complementing the effort of the I. At the Eigenvalues.

      [Eric Lancon] 14:15:44
      Is it it true or not? Maybe it's only effect off 50% above the needs

      [Enrico Fermi Institute] 14:15:56
      Okay.

      [Paolo Calafiura (he)] 14:16:00
      I can comment on the needs is already at my end up having been involved in the calculated One of the things we have to keep in mind is that the needs 2 sort of naturally tuning to the to the resources available.

      [Paolo Calafiura (he)] 14:16:20
      So there is no point in paralleling. Your needs are 100 times bigger than the resources you are available.

      [Paolo Calafiura (he)] 14:16:26
      So you make choices which makes those needs go down.

      [Paolo Calafiura (he)] 14:16:32
      And and what what I'm very nervous about is that as we try sort of to to to achieve a a a, a, a reasonable set computing model, we are potentially giving up things that we could do especially in a world of precision physics that we which is what which is the

      [Paolo Calafiura (he)] 14:16:55
      one where we are moving towards with the 1 3 run 4.

      [Paolo Calafiura (he)] 14:16:58
      I don't know about on 5, so I'm a little bit nervous that we that yeah, we we don't really need It It's still we don't really need it.

      [Paolo Calafiura (he)] 14:17:08
      But because we're making sure physics choices which are allowing us not to need it, and whether those choices are wise or not, I I probably not competent department, but they they said

      [Enrico Fermi Institute] 14:17:28
      The end was yeah.

      [Ian Fisk] 14:17:29
      Yeah, it was. It was also. It was just a comment about the scale, which is to say that I think that we've been sort of like driven into a We started planning for the W's the at Atlanta we had sort of factors of 6 or 10 more than we could expect.

      [Ian Fisk] 14:17:45
      And that we saw that it was really terrible. And then we've made some improvement.

      [Ian Fisk] 14:17:49
      So we fix it, and now it's down. But like the difference between failing completely and sort of making some really painful choices I think we're now at the level of like if, the Hbc's got us 25% and that allowed us to make a lot fewer really painful

      [Enrico Fermi Institute] 14:17:57
      You.

      [Ian Fisk] 14:18:04
      choices like I understand, 25% is not a factor of 4 or 5 like.

      [Enrico Fermi Institute] 14:18:06
      Okay.

      [Ian Fisk] 14:18:10
      It was a few years so back, but it it seems like like there was a time.

      [Ian Fisk] 14:18:14
      Certainly if someone told you that you had 20% more computing resources, you would have been through

      [Ian Fisk] 14:18:24
      And it just seems like these The these Brisbane are on the table.

      [Ian Fisk] 14:18:28
      They are. So we built them. They're there. It seems like we would be.

      [Ian Fisk] 14:18:34
      It's a really straight. It'd be a really strange choice not to at least try to use them

      [Eric Lancon] 14:18:40
      no, no, I agree. But the first thing is to get the software

      [Enrico Fermi Institute] 14:18:50
      Yeah, maybe that's a good way to lead over to the next, which is looking at how we're actually using these facilities like some of the integrations next slide

      [Enrico Fermi Institute] 14:19:02
      So where are we actually running today? Actively So, Atlas, you want to say something about So now, let's we've been, you know, using Corey and promoter for multiple years.

      [Enrico Fermi Institute] 14:19:15
      we we had an in having the hopper proposal for using attack from Tara.

      [Enrico Fermi Institute] 14:19:21
      Again. In the past we used olcf nails.

      [Enrico Fermi Institute] 14:19:25
      Yeah, yeah. But those are sort of government now. Yeah, most of the focus is on on the on nurse control. Better.

      [Enrico Fermi Institute] 14:19:32
      And but tackle. Yeah. Cms: Similarly, we focused on the user facilities because low hanging fruits it was easier?

      [Enrico Fermi Institute] 14:19:42
      And Corey Palmera multiple years, we have a exceed now, I guess, is access, hasn't happened yet.

      [Enrico Fermi Institute] 14:19:50
      So the next one you'll we'll have to deal with access We we had been running on whatever was available.

      [Enrico Fermi Institute] 14:19:58
      Currently that set is purchased to expense Anvil and Samp, 2 in the past.

      [Enrico Fermi Institute] 14:20:04
      It was Bridges comment there was, and Frontera, we've been running for multiple years, and then we had in the past, and one currently active in the past.

      [Enrico Fermi Institute] 14:20:16
      We had the theta allocation that was joined with with outlast.

      [Enrico Fermi Institute] 14:20:20
      We said to do some generated And now we have actually trying bit something a little bit more serious, which is on summit to get the contribute summit resources.

      [Enrico Fermi Institute] 14:20:35
      To the end of year, 22 Cms.

      [Enrico Fermi Institute] 14:20:40
      Data view construction, and this the physics, Validation of power was just completed, not mid summit, but with my 2,100, which is basically exactly the same system.

      [Enrico Fermi Institute] 14:20:51
      Architecture, the summit, but that was cpu only validation.

      [Enrico Fermi Institute] 14:20:56
      So hopefully she'd be old as the next step. Basically, that's what we want to do with sound.

      [Enrico Fermi Institute] 14:21:03
      Yeah. Also have some slides from the you know, Yeah, there's European efforts as well.

      [Enrico Fermi Institute] 14:21:09
      Just wanted to show it as an example of what's because they they follow sometimes different approaches and in terms of integration.

      [Enrico Fermi Institute] 14:21:16
      So you're using Gpus in the end of 2,020 into data. Really, that's the plan that we want to use.

      [Enrico Fermi Institute] 14:21:22
      We have 50,000 h on parameter that we got the allocation, and we have 50,000 h, and some which is not much, which we hope, so.

      [Enrico Fermi Institute] 14:21:31
      It's not going to contribute a lot, but we just want to show proof principle.

      [Enrico Fermi Institute] 14:21:36
      And then, if it works, then we would ask for more, hours for the next salesc to do this again.

      [Andrew Melo] 14:21:41
      sure, sorry. What was the second half of Rob's question I heard, and you want to use Gpus, and then I kind of yeah

      [Enrico Fermi Institute] 14:21:41
      But with the larger

      [Enrico Fermi Institute] 14:21:51
      So the I was asking if the in the plans for the end of 2022 data re-record and if you're going to use Gpus.

      [Enrico Fermi Institute] 14:22:03
      yes, I mean the the the problem is more at the moment, and putting together a workflow but trying to figure out which if you algorithms are ready, put it in and it, it might just be that we're going to run something in parallel to the normal, reconstruction, and then use, that as a

      [Enrico Fermi Institute] 14:22:23
      validation, Maybe run some validation samples. I would be happy with that as well.

      [Enrico Fermi Institute] 14:22:27
      It's not directly immediate to be reconstruction, but that more like work for again, and that they can compare

      [Andrew Melo] 14:22:35
      It is so about that we actually do have a an offline work, re reconstruct, workflow that's very close to being validated.

      [Enrico Fermi Institute] 14:22:39
      Okay.

      [Enrico Fermi Institute] 14:22:44
      And I know I know, I know.

      [Andrew Melo] 14:22:45
      Yeah, yeah, but it's just. It's just a matter of there's some There's some issues with the the Cp.

      [Andrew Melo] 14:22:52
      Side of the memory being, you know, take more than it needs, but I think by the end of the year, for sure, we're going to at least be doing some fraction of the reconstruction using with Gpus

      [Enrico Fermi Institute] 14:23:01
      Yeah, I hope I hope that that will happen, and then we can

      [Enrico Fermi Institute] 14:23:07
      Great. Yeah, as far as integration goes, specific technologies for Atlas we're we're using Harvester that runs at the edge.

      [Enrico Fermi Institute] 14:23:19
      So at all of our Hpc facilities we run a harvester process that essentially exists on the Hpc.

      [Enrico Fermi Institute] 14:23:24
      Login, nodes, Harvester directly pulls, drops down from Panda, transforms them, and packs them appropriately, so that they can, you know, be sent to the local Hpc.

      [Enrico Fermi Institute] 14:23:36
      it also handles the data transfer. So it facilitates staging.

      [Enrico Fermi Institute] 14:23:40
      That that data in and out of the pursuit of data federation essentially by way of a third-party service that lives at Bnl.

      [Enrico Fermi Institute] 14:23:50
      Hum. Yeah, And so you know, this approach works kind of on on all the sites, including Lcs, because pilots don't necessarily have to talk to the wider your network. Everything is, local and and Harvester facilitates all the communication panda through the shirt and file system.

      [Enrico Fermi Institute] 14:24:12
      Then we do things a little bit differently. Busy has advantages and disadvantages.

      [Enrico Fermi Institute] 14:24:18
      The advantages mostly on the Hpc. Integration of the user facilities, because it really makes it look like a great side.

      [Enrico Fermi Institute] 14:24:30
      It's basically the same approach we use for opportunistic was to use when we tried to run on the Ligo side.

      [Enrico Fermi Institute] 14:24:36
      We were basically we, the software is available. Here. Cvmfs or Cvs X.

      [Enrico Fermi Institute] 14:24:42
      That we run ourselves, we use container solutions or Sm.

      [Enrico Fermi Institute] 14:24:46
      Independence, local squared, and no man should storage at at these facilities, so we treat it as an extension of it's basically an add-on to firmly love storage so it uses.

      [Enrico Fermi Institute] 14:24:58
      Firm enough storage, or avoiding Aaa, the the whole Cms stars, but mostly from it up for reading input data, streaming input data.

      [Enrico Fermi Institute] 14:25:06
      And Then it stages out directly to fungi, so we don't have to worry about the local side storage or data transfers.

      [Enrico Fermi Institute] 14:25:11
      extension, managed. It's just everything is contained within the job, and the provisioning integration follows the Osg models.

      [Enrico Fermi Institute] 14:25:21
      So we submit pilots through ht Condo, Bosco.

      [Enrico Fermi Institute] 14:25:23
      Remote. Ssh! That's either the case of nurse directly connected to have cloud, or for exceed tag resource.

      [Enrico Fermi Institute] 14:25:31
      We go through was g-managed. HD. Conferences, and we might eventually also do the same for those you stage in or streaming is dreaming.

      [Enrico Fermi Institute] 14:25:40
      And do you know, have you measured, oh, staging and streaming to see the We know we know it for Nask, because a nice, we have no, basically we it's not the storage is now fully integrated, but at the beginning.

      [Enrico Fermi Institute] 14:25:56
      It wasn't fully integrated, and we just copied in more or less manually, The most often use pile up library.

      [Enrico Fermi Institute] 14:26:04
      They give us some space for that, and I actually have a comparison.

      [Enrico Fermi Institute] 14:26:07
      It makes very little difference for job failure reads Cpu: Efficiency is about 5 to 10% different.

      [Enrico Fermi Institute] 14:26:14
      Okay, So it's a small It's an efficiency organization.

      [Enrico Fermi Institute] 14:26:19
      It's a noticeable effect, but it's not a huge effect exactly.

      [Enrico Fermi Institute] 14:26:22
      You don't see a 50% trial, for example.

      [Enrico Fermi Institute] 14:26:28
      And the downside of this I mean the the upside is that it's it's it's simple.

      [Enrico Fermi Institute] 14:26:34
      We don't have anything running permanently at the Hbc side.

      [Enrico Fermi Institute] 14:26:38
      It's basically completely follows the the grid model integration.

      [Enrico Fermi Institute] 14:26:43
      The downside is that the Lcf. Are really not really compatible with this approach, because you don't have the outbound Internet you can't follow this approach completely The runtime kind of works the same way, because Cbmfs Xx and singularity.

      [Enrico Fermi Institute] 14:26:58
      Are both there, so that part works, and as long as you can, somehow, what a split server on the edge!

      [Enrico Fermi Institute] 14:27:03
      You can do things. The the degrade at the provisioning layer It's the larger issue.

      [Enrico Fermi Institute] 14:27:11
      Yeah, and we we only have prototypes. So far, nothing.

      [Enrico Fermi Institute] 14:27:13
      We would call, okay, okay, and triple a re.

      [Enrico Fermi Institute] 14:27:17
      So far cost is also not usable, so we can't stream to Lcf.

      [Enrico Fermi Institute] 14:27:21
      Batch nodes, the 2 possible solutions here X d. Proxy and principle is possible where we only ever talked about it.

      [Enrico Fermi Institute] 14:27:30
      I don't think anyone has ever set one up at an Lcf.

      [Enrico Fermi Institute] 14:27:33
      And it's probably too much network traffic to route through a single edge.

      [Enrico Fermi Institute] 14:27:39
      Note, no matter how well that mentioned, that is, but not at least so.

      [Enrico Fermi Institute] 14:27:43
      The scales we're talking about here to make click.

      [Enrico Fermi Institute] 14:27:47
      the other is that you act actively manage the storage.

      [Enrico Fermi Institute] 14:27:50
      So you do. Your rush, your integration, it lovers online, and then you just power live Cms.

      [Enrico Fermi Institute] 14:27:57
      Data management work for management stacks out with that location and pre-stage data And again at the Lcf type scale.

      [Enrico Fermi Institute] 14:28:04
      I think you you need to actively experience

      [abh] 14:28:06
      right, and could I pipe in here just for a second?

      [Enrico Fermi Institute] 14:28:09
      Yeah.

      [abh] 14:28:11
      people have used proxies at Nursk, mind you, the setup there is a little bit easier because they have multiple Dtn's, and you can actually put those all of the use all of them, all of the dtns for the proxy server.

      [abh] 14:28:23
      So so it is possible. But you need a rather fluid setup like nursk

      [Enrico Fermi Institute] 14:28:23
      Huh!

      [Enrico Fermi Institute] 14:28:32
      Yeah. As I said at nurse, it wasn't.

      [Enrico Fermi Institute] 14:28:34
      I mean, I think the work I know connectivity is good enough that we don't really need it at the moment.

      [Enrico Fermi Institute] 14:28:41
      It's not worth yet. Effort

      [abh] 14:28:42
      Okay.

      [Enrico Fermi Institute] 14:28:45
      And problem should be even better, Maybe we haven't really scale tested primarily at that level yet.

      [Enrico Fermi Institute] 14:28:52
      But from from what I saw with the how, the design has evolved, and that's what create us in terms of network integration.

      [Enrico Fermi Institute] 14:28:58
      And from what he said as well, I expected to working better and forward.

      [Enrico Fermi Institute] 14:29:04
      So you're see, the Cs plan is just to I'm going to just not even worry about local storage, and we formula doesn't have a global online license.

      [Enrico Fermi Institute] 14:29:21
      So our plan is that we do Multi-hop transfers through nurse, because Nasa will at the moment still has gr good ftp, and we're working with them to get the extra D transfers going once that is in place our plan is to to manage the Lcf

      [Enrico Fermi Institute] 14:29:35
      data transfer through nurse. So everything goes multi-hop through mass, so we will need a bit of space there.

      [Enrico Fermi Institute] 14:29:42
      So, and once that is a place we might start thinking, exploring, also running actively managed storage there.

      [Enrico Fermi Institute] 14:29:49
      But I will probably still have a large streaming component as a dumb question we could stop going down the rabbit hole.

      [Enrico Fermi Institute] 14:29:54
      Okay. But the I assume, like 7 of the tier, two's have global licenses.

      [Enrico Fermi Institute] 14:30:01
      We could route it through that, too.

      [Enrico Fermi Institute] 14:30:05
      For different.

      [Paolo Calafiura (he)] 14:30:11
      and just, to be sure, understand, by provision and integration you mean assigning the work to workers since they cannot reach

      [Enrico Fermi Institute] 14:30:19
      It's basically the the system. Basically, you have work in the system.

      [Enrico Fermi Institute] 14:30:26
      That is assigned to an Hbc. Now bring up resources to run that work and route.

      [Paolo Calafiura (he)] 14:30:32
      Yeah, yeah, yeah, understood. Yeah.

      [Enrico Fermi Institute] 14:30:33
      The work, then

      [Enrico Fermi Institute] 14:30:41
      So now we have. We have a slide on the security model, strategic conservation and security model.

      [Enrico Fermi Institute] 14:30:48
      We probably don't need to spend too much time on, because there's a discussion on Wednesday where we hopefully have some security folks from formula.

      [Enrico Fermi Institute] 14:30:59
      We invited someone, and maybe from Wsg. As well but we would think we're We wanted to discuss some of the strategic things about Hpc: use, and we we already covered some of it.

      [Enrico Fermi Institute] 14:31:12
      the yearly allocation cycle that it doesn't fit with our resource planning And so we can plan with resources that we're not sure we will have.

      [Enrico Fermi Institute] 14:31:20
      But so far we focused mostly on, since they don't fit our resource planning cycle, and we can pledge them.

      [Enrico Fermi Institute] 14:31:27
      We don't get any credit for it, which is mostly a problem eventually, for the funding agencies.

      [Enrico Fermi Institute] 14:31:31
      But there's another issue. If we say we are moving into a resource constraint, environment for Hlac, it also means resources that are not pledged, and that we can plan with we cannot include them as part, of our plan, which means our plan, has to artificially be downsized to not consider them

      [Enrico Fermi Institute] 14:31:49
      which might be a restriction on us at the moment.

      [Enrico Fermi Institute] 14:31:52
      It doesn't not so much because we have enough resources to cover everything we need to do.

      [Enrico Fermi Institute] 14:31:58
      But that might not be the case anymore in the Hlac environment.

      [Enrico Fermi Institute] 14:32:09
      see Erica's handle

      [Eric Lancon] 14:32:12
      yes, we'd like to intervene, because it's not the first time that we cannot pledge.

      [Eric Lancon] 14:32:20
      I think it's a bit too strong a statement.

      [Eric Lancon] 14:32:26
      It might be better to to say that didn't experiment, or the Wcg.

      [Eric Lancon] 14:32:34
      I need to evolve towards modern addicting campaigns.

      [Eric Lancon] 14:32:42
      If the because we would like to to use currently those Hpc.

      [Eric Lancon] 14:32:49
      As a regular Wseg site No, and it's not so very well suited for this.

      [Eric Lancon] 14:32:57
      You may want to consider that the experiment, you want, the cattle campaigns a few times in the inner year, and this campaign will short duration are exported to those Hpc.

      [Eric Lancon] 14:33:12
      Which have a large capacity. In that case you could consider great doing these resources because you don't have a flat requirement of Cpu across the year From the experiment, You see what I mean.

      [Enrico Fermi Institute] 14:33:30
      So you want to pledge it for specific purposes, specific.

      [Enrico Fermi Institute] 14:33:35
      You want to say like that, that this campaign is is is a pledged campaign on this resource, so that would move away.

      [Enrico Fermi Institute] 14:33:42
      I think we we had that this morning where we said we want.

      [Enrico Fermi Institute] 14:33:46
      We move away from the universal, usable resource pledge.

      [Enrico Fermi Institute] 14:33:51
      That is, basically we can. You could target anything at it to you pledge for a specific purpose.

      [Eric Lancon] 14:33:58
      yes. Because why is it? The Monte Carlo is quite on across the the year to first order?

      [Eric Lancon] 14:34:05
      It's because yeah, it's not enough capacity.

      [Eric Lancon] 14:34:08
      Cpu capacity to absorb the multicarbon simulation Within one month

      [Eric Lancon] 14:34:16
      Just one month is just an example. So the operational model should adapt to the is the type of resources that the experiments want to use.

      [Eric Lancon] 14:34:28
      Maybe

      [Enrico Fermi Institute] 14:34:32
      Okay, Hi Tens, Andrew

      [Andrew Melo] 14:34:38
      yeah. So. So so I did want to point out first off that there is a meeting.

      [Andrew Melo] 14:34:43
      The Wc. Meeting is planned for November.

      [Andrew Melo] 14:34:47
      we're actually going to discuss reopen, for the plan is, I guess at least it to someone who reopen the Mo.

      [Andrew Melo] 14:34:54
      and to discuss things like this. So I I don't think that that's gonna be stuff there forever.

      [Andrew Melo] 14:35:01
      And then I think that also, you know, there's there was just the new heps, for Benchmark is is quickly converging, so that we can actually Then Yeah, you know, these things we do have a unit that we can how do you say like, you know, to be able to make a resource request

      [Andrew Melo] 14:35:18
      then also pledges in. I I do want to push back a little bit and say that like probably don't want to have the pledging infrastructure Be so phygrained to say that we are going to request that we get X amount of whatever's for a certain amount of time

      [Andrew Melo] 14:35:36
      the resources. But I do think that the ability to

      [Andrew Melo] 14:35:45
      Put. Put put these put these facilities into the pledge, and in a holistic way, is something that's going to be hopefully coming with the with the cycle of everything.

      [Andrew Melo] 14:35:51
      How it works definitely. Not 24, but maybe in like the 2526 time scale.

      [Andrew Melo] 14:36:03
      I think that, like

      [Andrew Melo] 14:36:04
      I think that, like you know, with with with the benchmarks and come around that we can actually, you know.

      [Andrew Melo] 14:36:10
      Say what they need to quantify with these machines are, and the I guess political idea that we're gonna reel from the conversation on the Mlu that hasn't been sense, you know, or whatever it is, I think that this is something that we can hopefully get done, in the next you know in the short

      [Andrew Melo] 14:36:25
      term

      [Enrico Fermi Institute] 14:36:27
      Okay.

      [Enrico Fermi Institute] 14:36:27
      okay.

      [Enrico Fermi Institute] 14:36:30
      Okay, smart a com.

      [simonecampana] 14:36:35
      yes, I think there is a bit of confusion. First of all, on the latest topic.

      [simonecampana] 14:36:41
      If you read the mou there is nothing written there, says that an Hpc.

      [simonecampana] 14:36:46
      Cannot be used as a place to resource as simple as that, so one doesn't have to.

      [simonecampana] 14:36:50
      Redis. Discuss them, or you to discuss this. I There are good Hpc is the impact of the pledges, since at least a decade and a half in the Nova country, you know the Tier one provides resources also partially through time on an hbc so the reality is that the mou tells

      [simonecampana] 14:37:10
      you the basic principles of what can be considered a pledge, Resource has to be something with a certain amount of ability.

      [simonecampana] 14:37:18
      Availability needs to be accounted for. You need to be able to send a ticket to it, and that's what it says.

      [simonecampana] 14:37:22
      So I think that you know, in terms of policy, we don't need the and made Zor discussion and every right of the emoji.

      [simonecampana] 14:37:35
      The work can start today. Think there is something technical to be done, because a lot of what I just mentioned.

      [simonecampana] 14:37:40
      Yeah, Okay, be a technical detail, But someone still has to do the work of integrate integrating the facility properly.

      [Enrico Fermi Institute] 14:37:49
      But but

      [simonecampana] 14:37:50
      The other thing is that when is is the comment I made this morning when you try to define a facility that works for one use case you have 20%, which granularity you want to get If it.

      [simonecampana] 14:38:06
      Is monte Carlo versus the data processing fine.

      [simonecampana] 14:38:09
      If it is a second kind of Monte Carlo, a bit less fine, if it is only a bench generation, because it's the only one that doesn't need an input it starts becoming really finegrained. And for the one of you who participated to a discussion at the Rugby and you know

      [simonecampana] 14:38:25
      the all the process that has to do with resource, requests, etc.

      [simonecampana] 14:38:31
      This becomes very complicated very quickly. So at the end, the risk is that we do a lot of work to pledge Hpcs for a benefit that is not particularly measurable.

      [Enrico Fermi Institute] 14:38:38
      Yeah.

      [simonecampana] 14:38:46
      I think we are confusing. We cook and the work that those Hpcs are doing, and this should be done with the idea that those Hpcs are a multi-purpose facility which today many of them they are not some of them if you try to discuss with the Awkward for

      [simonecampana] 14:39:03
      example today, there is not a lot you can do with a quiz unless you can use all those gpus.

      [simonecampana] 14:39:09
      So is that a multi-pacose facility today is not so.

      [simonecampana] 14:39:11
      I think there is a bit of confusion around what is a policy?

      [Enrico Fermi Institute] 14:39:14
      Okay.

      [simonecampana] 14:39:16
      What is practical, and what needs technical work to be done.

      [simonecampana] 14:39:20
      So. I think this needs to be organized a bit

      [Enrico Fermi Institute] 14:39:25
      But but but even at the policy level, the the one example you gave is is something that, I maybe I should use the word non wlcg resource, or something like this.

      [Enrico Fermi Institute] 14:39:35
      but the the idea of reliability on something where you're not going to use it.

      [Enrico Fermi Institute] 14:39:39
      9 months of the year and then you're gonna get a burst of, you know, 200,000 cores.

      [Enrico Fermi Institute] 14:39:48
      Policy wise. I'm not sure that has any translation.

      [Enrico Fermi Institute] 14:39:51
      I mean that there are for the sorts of resources we're talking about here.

      [Enrico Fermi Institute] 14:39:55
      It. It doesn't fit within the the policy framework That's that's my my concern.

      [Enrico Fermi Institute] 14:40:01
      If if the policy is, it needs to be up 90% of the time, and you need access to a certain base load.

      [Enrico Fermi Institute] 14:40:09
      Of course, first once a year. That's that's not how these things work. So that's why I was saying that we we really do need the policy work here as well

      [simonecampana] 14:40:19
      a little bit, but the reality is that a lot of what we care about is that not not 90% of your jobs fail when you end up there And this being an Hpc.

      [simonecampana] 14:40:29
      Or a great site. I'm sorry it It's a useful thing to ask right

      [Enrico Fermi Institute] 14:40:36
      Yeah, you know, at the same much of the same way that you have, and the power ecosystem, base load, and and variable demand mode.

      [Enrico Fermi Institute] 14:40:47
      I think we have need to have some more fundamental ideas, and the policy framework.

      [Enrico Fermi Institute] 14:40:54
      You know we're you don't right now our power grid is built from cold, and only call, and we say that when can't possibly, it'd be counted for, and and we both of course, have been successful

      [simonecampana] 14:40:59
      yeah.

      [simonecampana] 14:41:04
      I just

      [simonecampana] 14:41:07
      I understand. Brian, but you realize that the discussion on availability is not the one that is today is stopping an Hbc. To be a pleasant resource.

      [simonecampana] 14:41:14
      Right

      [Enrico Fermi Institute] 14:41:16
      Let's take a couple more quick comments, and then we can have more discussions about pledging on on Wednesday. Yeah, we have a dedicated discussion, Andrew, do you have a quick comment

      [Andrew Melo] 14:41:26
      sorry. My hand is still on, but but I'll just quickly point out that.

      [Andrew Melo] 14:41:32
      but that we can't today do this budget, because it's it's not that the pledging statute you can't use Hbc's and pledging.

      [Andrew Melo] 14:41:41
      It's just up the room that are set around Plunge the how you fled.

      [Andrew Melo] 14:41:45
      Resources. Basically, you can't do that, It's it's not that it's like there's an explicit for prohibition from it.

      [Andrew Melo] 14:41:52
      But you just simply just simply can't do it.

      [Enrico Fermi Institute] 14:41:54
      yeah.

      [Enrico Fermi Institute] 14:41:55
      Yeah.

      [simonecampana] 14:41:56
      I I just don't understand this, but fine I'll let it go.

      [simonecampana] 14:41:59
      I mean, there are other places where pledge they pledge.

      [simonecampana] 14:42:02
      Hbc: drop down something that's right.

      [Enrico Fermi Institute] 14:42:02
      Yeah, yeah, yeah, but they they basically put a grid side on top of it.

      [simonecampana] 14:42:07
      Well, then, yeah, you have to do some work. Yes, I agree.

      [Enrico Fermi Institute] 14:42:07
      So with with all the rules. Oh, no! But the problem is here.

      [simonecampana] 14:42:10
      Yeah.

      [Enrico Fermi Institute] 14:42:12
      It means that you would have to influence the scheduling of the Hpc.

      [Enrico Fermi Institute] 14:42:18
      Facility. So the Hbc facility itself would have to adjust internally, adjust their scheduling policy to match the grid model, at least for a fraction of the site And that's just not how things are done in the us We are customer.

      [Enrico Fermi Institute] 14:42:33
      We don't tell them how they do their scheduling.

      [Andrew Melo] 14:42:35
      Okay, Or let me give another example. Let's say that you know today, and I I don't I don't know like you know, the inside of it.

      [Enrico Fermi Institute] 14:42:35
      We use the resources as they give them to us

      [Andrew Melo] 14:42:41
      But you know, let's say that we're not using Amazon for Cms jobs.

      [Andrew Melo] 14:42:46
      We can't send sideability, you know. We can't send Sam tests to Amazon right now, so you know, whatever resource, whatever check the Amazon's gonna give doesn't show up, and the the the monitoring, now, it shouldn't be, that way But that's that's how it

      [Andrew Melo] 14:43:02
      is.

      [Enrico Fermi Institute] 14:43:04
      let's

      [Enrico Fermi Institute] 14:43:05
      Let's take a comment from from Ian, and then let's move on

      [Ian Fisk] 14:43:07
      I My call was, as I understood this was a blueprint meeting which a blueprint is typically the design for something that you're going to build in the future which means that I think we need to be a little bit careful when we talk about.

      [Steven Timm] 14:43:07
      good.

      [Ian Fisk] 14:43:19
      Sort of the reality of right now and the limitations that we face right now and try to be able to see a little bit farther ahead.

      [Ian Fisk] 14:43:26
      For when some of the times when those limitations will not be there, and so if we want to talk about pledging, maybe we need to sort of define it.

      [Ian Fisk] 14:43:32
      In such a way that it it's the ability to maybe the ability to run All workflows or the ability to run some subset of workflows.

      [Ian Fisk] 14:43:41
      But I I think it. We we do ourselves a disservice.

      [Ian Fisk] 14:43:43
      If we expect that nothing's going to change, because I think we will, as a field along with the rest of science, figure out how to use these machine, and we will, and we will figure out how to use clouds.

      [Ian Fisk] 14:43:57
      And we're and we need to sort of plan for our own success.

      [Ian Fisk] 14:43:59
      I think

      [Enrico Fermi Institute] 14:44:05
      So that's a great point

      [Enrico Fermi Institute] 14:44:08
      month. Yeah, we already talked quite a bit about the second point.

      [Enrico Fermi Institute] 14:44:13
      I just wanted to go into it a little bit, because the one ish thing that okay hasn't brought up yet.

      [Enrico Fermi Institute] 14:44:20
      So so basically how we deal with more larger architecture changes.

      [Enrico Fermi Institute] 14:44:24
      We we went into that quite a bit. Already We we already seen this.

      [Enrico Fermi Institute] 14:44:29
      Today, we have, we see multiple Gpu architectures, basically the early porting efforts to Gpu They focused on Nvidia because that's what everyone is using to a large extent.

      [Enrico Fermi Institute] 14:44:40
      That's still what everyone is using. But if you look at what the Lcf.

      [Enrico Fermi Institute] 14:44:43
      Is deploying Frontier has a D. Whenever maybe different, we'll have intel.

      [Enrico Fermi Institute] 14:44:52
      So what are we doing there and then? The next generation might have some weird Fpga ai acceleration.

      [Enrico Fermi Institute] 14:44:58
      Who knows? I know that the framework groups, and this is outside the scopeia is, is looking at performance, portability, solutions.

      [Enrico Fermi Institute] 14:45:06
      so far it looks like yes, you can run everywhere, but you take a severe performance.

      [Enrico Fermi Institute] 14:45:11
      It? Is that enough? That's an ony topic here, but that's the only alternative If that's not enough.

      [Enrico Fermi Institute] 14:45:20
      And if this doesn't work, then you kind of have to limit what you can target, because I'm not sure

      [Taylor Childers] 14:45:26
      sure. Can I push back on that? The you know the Pps group and and have Cce has shown that you can use these frameworks, and sure gonna take a performance that.

      [Taylor Childers] 14:45:38
      But I would argue. 10% is not something that is worth the effort.

      [Enrico Fermi Institute] 14:45:41
      Okay.

      [Enrico Fermi Institute] 14:45:45
      If there was a question mark, because maybe maybe it is to rescue

      [Taylor Childers] 14:45:45
      especially in the mad graph case. Right?

      [Taylor Childers] 14:45:50
      I mean, we're running mad graph with base cuda sickle cocos, alpaca, and sure cuda outperforms.

      [Taylor Childers] 14:46:02
      But the amount of work that has gone into the kuda to get another 10% It's just not worth it.

      [Enrico Fermi Institute] 14:46:11
      Because I think the the 2 options here are like, given the what we have to do in terms.

      [Enrico Fermi Institute] 14:46:17
      And I know this is outside the scope of the workshop, but it impacts what we can plan with basically the only 2 options, Either performance put the portability or we just don't target a certain architecture because we cannot just every 5 years, if lcf decides they want this new greatest and best

      [Enrico Fermi Institute] 14:46:36
      acceleratorship. We cannot just refactor our old software stack It's just not fun.

      [Enrico Fermi Institute] 14:46:44
      So

      [Enrico Fermi Institute] 14:46:48
      Okay. And then in terms of strategic considerations, the use, just because we managed to be able to use this generation's Lcf.

      [Enrico Fermi Institute] 14:46:59
      Doesn't really guarantee that we can use the next, So we need to keep that in mind when we kind of do the long-term planning, because that might come a point where basically the amount of usable usable for us hpc deployment goes down and we need to shift that

      [Enrico Fermi Institute] 14:47:15
      capacity, some ways

      [Enrico Fermi Institute] 14:47:21
      And then there's a quote anyone else have any other comment or concern.

      [Enrico Fermi Institute] 14:47:27
      Strategically about going all in on the like, making the jump, as Paolo said.

      [Enrico Fermi Institute] 14:47:32
      It on the Hpc. Side where we can miss Ms jump

      [Enrico Fermi Institute] 14:47:39
      3 in terms of making the jump mean. I mean, we can sort of hedge our bed a little bit with that, Right? I mean, we don't have to make to jump with 100%.

      [Enrico Fermi Institute] 14:47:52
      Of our computing on. So I mean, that's I mentioned that you don't jump in one.

      [Enrico Fermi Institute] 14:48:01
      at the top. You make a small jump, you see where you are. And you make another jump.

      [Enrico Fermi Institute] 14:48:07
      It's a gradual process

      [Paolo Calafiura (he)] 14:48:09
      one thing. One thing I want to say, which I've heard from from a reliable source is some some community with my multiple jumps is the first jump is the worst one.

      [Enrico Fermi Institute] 14:48:10
      Yeah.

      [Paolo Calafiura (he)] 14:48:22
      The same, and the fourth are increasingly easier, the more the more the more you go for one after that architecture to the other, the the least the least you have to to feed that we are your call could go from one

      [Enrico Fermi Institute] 14:48:40
      Yeah, I didn't even mention it here, because I don't think it's a big problem.

      [Enrico Fermi Institute] 14:48:44
      The multiple Cpu architectures. I think that's at least I don't see the big issue on the Cms side.

      [Enrico Fermi Institute] 14:48:50
      That's just usually, just a recompile and a revalidation.

      [Enrico Fermi Institute] 14:48:55
      The the jeep, the jump to Gpu and I just I'm not

      [Paolo Calafiura (he)] 14:48:58
      no What I'm saying is that once you jump to Gpu or to let's say, a parallelization layer, whatever it is that is a very painful jump.

      [Paolo Calafiura (he)] 14:49:09
      But once you have done that jump, but going from one Gpu to another, or from one Gpu to some, so far I'm known architecture, which we do, you know, the French are both matrix multiplications, and what jacks for example, going to Jack's maybe maybe less less painful than than the first

      [Enrico Fermi Institute] 14:49:11
      Just

      [Paolo Calafiura (he)] 14:49:27
      one That's what I'm saying. That's what I was trying to say.

      [Enrico Fermi Institute] 14:49:35
      Okay, we move on. I think we have some presentations. Next, let's go in the class We want to say something on this. I don't think we say anything.

      [Enrico Fermi Institute] 14:49:47
      On the security model. We'll we'll talk about the security model.

      [Enrico Fermi Institute] 14:49:48
      Yeah, yeah, So you're an Andre, are you? Are you connected?

      [Enrico Fermi Institute] 14:49:55
      Do you want to share? Yeah.

      [Andrej Filipcic] 14:49:56
      maybe it's it's a screen. Can you hear me? Right?

      [Andrej Filipcic] 14:49:59
      Okay, that's Michelle didn't

      [Enrico Fermi Institute] 14:50:02
      Great. So we want to show a little bit what's going on the European side.

      [Andrej Filipcic] 14:50:04
      Just

      [Enrico Fermi Institute] 14:50:08
      Yeah, then we can just as a

      [Andrej Filipcic] 14:50:09
      Right? So just a bunch of slides. But let me know if you are interested in anything else.

      [Enrico Fermi Institute] 14:50:12
      Yeah.

      [Andrej Filipcic] 14:50:18
      Oh, on some specifics over here. So maybe it's a bit too generic.

      [Andrej Filipcic] 14:50:21
      So the Irish Pc. Joint to the taking his, let's say, a company of 31 States, which I call out here on the right side.

      [Andrej Filipcic] 14:50:34
      All the members apart. Basically all Europe, and Turkey. Apart from me, Okay, and Switzerland.

      [Andrej Filipcic] 14:50:39
      And in the first place, which ended last year, the Web, 8 machines funded.

      [Enrico Fermi Institute] 14:50:43
      Okay.

      [Andrej Filipcic] 14:50:46
      So 3 prixes scale machines in the range of 250 to 350 billion blops.

      [Andrej Filipcic] 14:50:51
      so those one Lumi in Finland, Leonardo, which will be inaugurated to November in Italy, and Marin Austin, which will be a bit later.

      [Andrej Filipcic] 14:51:02
      It goes to Pickerman just finished, but not much details on this machine or yet no, apart from the talk today, Will, he had a quite large Cpu Partition of 30 peasa flops.

      [Andrej Filipcic] 14:51:14
      Which is quite good. For, let's say so. The second phase is the 6 years up to 27, and the currently approved machines, the high range one the exa scale me which would be a Jupiter the the so the machine was just approved but the procurement, was

      [Andrej Filipcic] 14:51:35
      not yet done so all no details on this machine, just the the plans right? Basically there.

      [Andrej Filipcic] 14:51:42
      One to reach one, Maxa flop with some or okay, that's enough.

      [Andrej Filipcic] 14:51:47
      And so there will be 4 arrangements. So for Hpcs in so investments here between 20 and 40 million Europe rate per each, and those one will be in Greece.

      [Andrej Filipcic] 14:52:02
      Hungary, or on an island. I think also there'll be some collocated quantum computers.

      [Andrej Filipcic] 14:52:12
      So the first generation, and this will be approved probably next month.

      [Andrej Filipcic] 14:52:18
      I was skating So this is just a mission which you can read that day later on.

      [Andrej Filipcic] 14:52:24
      Basically your Hpc wants to support leadership, supercomputing, including quantum computing and all the data infrastructure around it.

      [Andrej Filipcic] 14:52:35
      Then they want to develop. They're on hardware, and they want to evolve industry a lot.

      [Andrej Filipcic] 14:52:42
      Let's say to bullet. So the budget. The budget is pretty 50% of from European Commission and 50% from the hosting states.

      [Enrico Fermi Institute] 14:52:46
      Okay.

      [Andrej Filipcic] 14:52:55
      So these are the countries that decide to build the Hpc.

      [Andrej Filipcic] 14:52:58
      although for the smaller machines you're European Commission only funds 35%.

      [Andrej Filipcic] 14:53:04
      So in the phase, one, the 3 S. 1 billion euros were spent for the face to 7 to 8 billion is actually foreseen on the on the plot on this table on the picture you have a detailed breakdown from the European Commission and Then there would be the same matching contribution from

      [Enrico Fermi Institute] 14:53:11
      Okay.

      [Andrej Filipcic] 14:53:25
      the all the Member States. Okay, and also 200 Me, let's say, 200 million is meant for hyperconnectivity.

      [Andrej Filipcic] 14:53:33
      So for Terabyte Network and 50% of the money spend for new product infrastructure.

      [Andrej Filipcic] 14:53:42
      There are many projects in the Tv activities going around it.

      [Andrej Filipcic] 14:53:45
      So, maybe one important one is you eurocc or European competence center which basically he's a very large project with 30 participants of participant.

      [Andrej Filipcic] 14:54:01
      State let's say so. Most of them, and the funding is about 1 million Europe or country per year.

      [Andrej Filipcic] 14:54:07
      the goals are basically to training and connection with the industry and collecting.

      [Andrej Filipcic] 14:54:14
      So the knowledge on Hpc. Whatever that means. There's also centers of excellence, for example, which are mostly dedicated to, let's say, support software development or scalability, extensions of particular groups.

      [Andrej Filipcic] 14:54:28
      They can be dedicated to a particular particular field of science, like chemistry, or molecular dynamics, or something like that, or they can be a bit wider in scope for specific.

      [Andrej Filipcic] 14:54:38
      Let's say data handling for access case something like that.

      [Andrej Filipcic] 14:54:44
      the about 10 they send us a tax sentence, initial funded between 6 to 8 meetings per project, and those calls would be continuing all the time.

      [Andrej Filipcic] 14:54:53
      So to to this period. There are 2 bodies. So research, generation, advisory group and infrastructure advisory group, which basically accepts form recommendations for the illusion in develop and so forth basically, for everything for the research calls for funding and for infrastructure deployment

      [Andrej Filipcic] 14:55:16
      another part of it is your Pm. Process initiative with the name to build European Cpu and Gpu.

      [Andrej Filipcic] 14:55:24
      Of course, maybe he'll be slightly written. That's later.

      [Andrej Filipcic] 14:55:30
      There's also your master for Hpc. Which is just a common university program.

      [Andrej Filipcic] 14:55:36
      So this is a project. The tries that's many countries and universities.

      [Andrej Filipcic] 14:55:42
      Let's say about 30 of them will try to put the Hpc.

      [Andrej Filipcic] 14:55:47
      Studies master status typically in sync and share. Let's say, students share lectures, and so on.

      [Andrej Filipcic] 14:55:57
      there are about 30 projects altogether, so the resource location access is only provided to you in typical users.

      [Andrej Filipcic] 14:56:07
      So basically to members of European Union, the extended one, actually the European Commission, we so share is very similarly managed as praise before.

      [Andrej Filipcic] 14:56:19
      So the place, like calls for publications, with some changes, The first one is developing batch parking.

      [Andrej Filipcic] 14:56:26
      with basically immediate access. So let's say, within a less than a month, maybe even, we think to mix 2 weeks And this is not negligible even in resources.

      [Enrico Fermi Institute] 14:56:31
      Okay, this.

      [Andrej Filipcic] 14:56:37
      So you can get something like up to half a 1 million Cpu hours.

      [Andrej Filipcic] 14:56:43
      for these access, and you get it for for up to a year.

      [Andrej Filipcic] 14:56:48
      Then the regular access, which is a couple of 10 million Cpu hours.

      [Andrej Filipcic] 14:56:53
      Cisp reviewed, and there are also there will be calls future on for industry in public sector.

      [Andrej Filipcic] 14:57:00
      This is not yet right, finalized. Yet, because of the funding issues.

      [Andrej Filipcic] 14:57:05
      And let's say, charging for the industry. So the hosting entity, share.

      [Andrej Filipcic] 14:57:11
      So the owner of the the other house of the Hpc.

      [Andrej Filipcic] 14:57:14
      The country, the policies there are completely regulated by country policies or decisions.

      [Andrej Filipcic] 14:57:21
      So each State can do whatever they want with their latch.

      [Andrej Filipcic] 14:57:28
      so overall the design is of some of Hbc.

      [Andrej Filipcic] 14:57:37
      Is quite classical, but not all of them are really classical.

      [Andrej Filipcic] 14:57:40
      Hbc anymore, as you know, Vega can. Slovenia can was designed to be strict mind of heavy duty data processing, and outbound connectivity which works actually pretty well.

      [Andrej Filipcic] 14:57:54
      For Atlas, where Vega contributes something right between 1340% of Cpu.

      [Andrej Filipcic] 14:58:01
      during the last year, Let's say, then, the second one, Lumi.

      [Andrej Filipcic] 14:58:03
      They have a very large dedicated partition for visualization and services, and they will provide only so they will provide set object storage for long term data preservation.

      [Andrej Filipcic] 14:58:16
      So on, and they want to provide all the more than tools modern nose.

      [Enrico Fermi Institute] 14:58:19
      Okay.

      [Andrej Filipcic] 14:58:20
      From 5 I mentioned it here. It was not been built, but they set that they will be have much larger cpu partition and open access, because the Government decided that the this machine needs to support Ac.

      [Andrej Filipcic] 14:58:35
      So this was great already about on overall, in the architecture, so most of these machines are Janet purpose.

      [Andrej Filipcic] 14:58:46
      Some maybe less general purpose than the others, but they basically all the all of them needs to adapt to the user needs.

      [Andrej Filipcic] 14:58:54
      So they are. It's a bit different. So they're not completely free to set the policies.

      [Andrej Filipcic] 14:59:00
      How these machines will be set up, and what services they can provide, because overall the European, your Hpc.

      [Andrej Filipcic] 14:59:07
      Governing Board, which is representative. These from States can say on what to do with these machines.

      [Andrej Filipcic] 14:59:16
      Right. And there are many countries that participate in these calls, but they don't have Hpc.

      [Andrej Filipcic] 14:59:23
      But they would like to to use it. And for basically, for all the science.

      [Andrej Filipcic] 14:59:27
      And so also interesting, does this stuff. So the current machines mixture of Cpu and Gpu partitions.

      [Andrej Filipcic] 14:59:37
      So Dcp is mostly Amd. Then some intel recently, for example, would be intel.

      [Andrej Filipcic] 14:59:45
      Then there is a one arm machine that will be in Portugal based on fujitsa and they have both Nvidia and and the but most have Nvidia Gpus, and some have like you only have Md So Hello, me.

      [Enrico Fermi Institute] 15:00:03
      Okay.

      [Andrej Filipcic] 15:00:05
      Is the same, Okay, what's the name of the Ocf.

      [Andrej Filipcic] 15:00:08
      Right so. But in any case most notes have gpus.

      [Andrej Filipcic] 15:00:14
      So most of the hardware is Gpu compromises between 60 to 80%.

      [Andrej Filipcic] 15:00:20
      It depends on the machine. Well, there's one small machine, Cpu only, but all the big machines have.

      [Andrej Filipcic] 15:00:25
      Let's say, 24% of Cpu notes, not not even Cpu power.

      [Andrej Filipcic] 15:00:31
      Right? Okay, computing power. So the storage is typically last with Seth, and some also provide some kind of yeahf.

      [Andrej Filipcic] 15:00:40
      So this one is less popular, and most of these machines, basically a apart from Lumi and Carolina, is in in in Czech Republic we're built up by artists during the future machines.

      [Andrej Filipcic] 15:00:59
      Will most definitely. So the large one next Xs K machine, which will be built in France, will be arm based, so it would be arm cpu plus gpu, as well.

      [Andrej Filipcic] 15:01:10
      So details are not clear yet the goal is to build it somewhere in 2425, and after that the next one.

      [Andrej Filipcic] 15:01:20
      So let's say them excess scale machine, whatever it would be.

      [Andrej Filipcic] 15:01:24
      They have strong, wishes. Let's see, for now it should be risk 5 based next slide.

      [Andrej Filipcic] 15:01:33
      So some thoughts, some observation. After 1 point, 5 years of operations of these machines, so each of them have of the order of something like 500 users, which might seem a bit little, for some but in actually most of these users, are completely newcomers, since many other users already have allocations on

      [Andrej Filipcic] 15:01:57
      the large existing machines, let's say, in Italy, Spain, Germany, or France.

      [Andrej Filipcic] 15:02:02
      So go. Ones that are part of price, and do these users have really a lot of different kinds of workloads, so that many node computes jobs We've done Cpu on Gpu And this is mostly carol th the the majority i'll say, is chemistry or material

      [Andrej Filipcic] 15:02:21
      science, although something like at least on Vega. There are something like 30 different applications that the user want to run a lot of users also do small notes or small core parameter scans on tons of independent jobs let's say, let's see like and many many users in the last

      [Andrej Filipcic] 15:02:43
      year start to use machine learning, even. That less users do analysis with machine learning, and this is rapidly actually growing.

      [Andrej Filipcic] 15:02:51
      Because let's say it's quite simple with Tensorflow and all the case.

      [Andrej Filipcic] 15:02:55
      So to locates the around and atar machine, at least in Vega we have a really big pressure on Gpus, so the next machine will buy. We'll have much larger Gpu Partition Oh, some users also do extreme data process processing.

      [Enrico Fermi Institute] 15:03:08
      Okay.

      [Andrej Filipcic] 15:03:14
      No I don't mean I let's see here.

      [Andrej Filipcic] 15:03:17
      But, for example, like something I cry micro microscopy or different stuff, where they produce, let's say, a couple of tens of terabytes per measurement that they want to process is same interactively some hpcs, allocate for notes, only but there are many that can run any type

      [Andrej Filipcic] 15:03:32
      of jobs, Also, observe, we have observed my experience, that many users are not quite happy with the default a data organization of the Hpcs, which basically more or less doesn't exist.

      [Andrej Filipcic] 15:03:45
      I would say, although we have other tools in my future. But let's say, within your your Hbc.

      [Andrej Filipcic] 15:03:51
      The data Migration is a movement. We're not yet discussed and many users stick to containers and some demand event.

      [Andrej Filipcic] 15:04:01
      But virtualization basically here for your Hpc. What user demands to use it basically should be provided to another later.

      [Andrej Filipcic] 15:04:11
      there are much more users on your Pc. At this point, and was ever in place.

      [Andrej Filipcic] 15:04:16
      So this number we probably wrote. So that's it. Cumulatively to 50,000. Pretty soon, on all the all on all the machines and there, are a lot of really a lot of newcomers due to simplicity or taxes, you basically just submit a proposal, not even a

      [Andrej Filipcic] 15:04:33
      Proposal. Application which quick description. And you will get an access within less than a month.

      [Andrej Filipcic] 15:04:39
      The usage for the interstate is rising a bit.

      [Andrej Filipcic] 15:04:43
      this is mostly small or medium enterprises, but this is still not extremely high.

      [Andrej Filipcic] 15:04:50
      Let's say more or less, the entire to use 20% of Hpcs by European law.

      [Andrej Filipcic] 15:04:57
      Let's say, by European funding regulations, but they're not yet at 20.

      [Andrej Filipcic] 15:05:06
      Far from 20% of usage. This point although some Hpcs like the one in Luxembourg, was built entirely to support the industry.

      [Andrej Filipcic] 15:05:15
      7 countries also decided to provide resources through your Hbc. For 4.

      [Andrej Filipcic] 15:05:20
      Let see, Slovenia for sure, because I know what's going on here.

      [Andrej Filipcic] 15:05:25
      You have the same message from Spain or the others be shared.

      [Andrej Filipcic] 15:05:28
      And lately, even in Germany. So I think that the German wants to only keep Daisy and Kit not not sure if this is official yet, but other countries will probably follow a similar way.

      [Enrico Fermi Institute] 15:05:36
      question.

      [Andrej Filipcic] 15:05:40
      So. The and let's see. European. Sorry. Yes, go ahead.

      [Enrico Fermi Institute] 15:05:42
      Questioning.

      [Enrico Fermi Institute] 15:05:47
      so you said. Several countries have already decided, you know, like Sylvania, with highly successful Vega.

      [Enrico Fermi Institute] 15:05:53
      What about the Vega design makes that so much easier to integrate in these would be some of the Us.

      [Enrico Fermi Institute] 15:06:06
      Snowflakes.

      [Andrej Filipcic] 15:06:09
      I'll show, because in Reggae we need the pressure to support civilian vessels.

      [Andrej Filipcic] 15:06:14
      So some some other more classical Hpcs are hesitant in this respect.

      [Andrej Filipcic] 15:06:20
      But let's say Vega is not so different in hardware.

      [Andrej Filipcic] 15:06:23
      Architecture than the others apart, that, apart from that that really required a large pipe which can at the moment this pipe can do 600 gigabits per second to one.

      [Andrej Filipcic] 15:06:35
      That's it to Jean. And this will increase in the future.

      [Andrej Filipcic] 15:06:40
      So it's mostly a matter of decision. What you are allowed allow users to do over there

      [Enrico Fermi Institute] 15:06:49
      Okay.

      [Andrej Filipcic] 15:06:51
      done. The network connectivity. Will will likely boost a lot in the next 2 years. On 2 to 3, as let's say, especially if there's zone.

      [Andrej Filipcic] 15:07:01
      One or a bit network is seen, don't There are some still open questions about the funding is came, and who can do the networking, and so on.

      [Andrej Filipcic] 15:07:14
      long term, long term, data story is not that part of the plans So it's a bit on a wild but there's a high pressure of many communities to use this as well. Right?

      [Andrej Filipcic] 15:07:25
      So the Hpcs. This point are not obliged to provide long-term storage.

      [Andrej Filipcic] 15:07:29
      Let's say when the Hbc. Is decommissioned, the storage is really likely to be the commission as well, and a new storage will be read, brought up in the new machine right?

      [Andrej Filipcic] 15:07:39
      But this this will need to change the future. One thing, one thing worth to stress is that some leadership your projects like destination, Earth?

      [Andrej Filipcic] 15:07:49
      Well, I'm not sure you know it, but destination out to basically as apples.

      [Andrej Filipcic] 15:07:54
      Ecmwf Weather Agency, and it you mets up.

      [Andrej Filipcic] 15:07:59
      But they, is to provide a digital between digital going to for Earth right?

      [Enrico Fermi Institute] 15:07:59
      Hmm.

      [Andrej Filipcic] 15:08:06
      which includes satellite imaging weather collection.

      [Andrej Filipcic] 15:08:12
      whether the data collection on weather forecasting, and so on, and basically do a global model of our team predictions And so on.

      [Andrej Filipcic] 15:08:23
      With, basically, it's a huge project. And this this organization already officially asked, join to the taking if they could use your Hpc on the production.

      [Andrej Filipcic] 15:08:33
      Level, I'm basically joined the deck and agreed, for now they can use 10% of the All the resources right.

      [Andrej Filipcic] 15:08:43
      European Commission. But up to 10% and more organization to follow this way, for example, destination Earth doesn't have enough funding or money to to do anything without your Hbc.

      [Andrej Filipcic] 15:08:57
      At this point. So more projects with this I like this will follow, and maybe even let's see Good.

      [Andrej Filipcic] 15:09:07
      But this was not discussed yet. I will skip the next slides, because I just a bit of an overview for computer as you can.

      [Enrico Fermi Institute] 15:09:10
      Okay.

      [Andrej Filipcic] 15:09:15
      You will see them later on when I upload them

      [Andrej Filipcic] 15:09:20
      Okay, That's it.

      [Enrico Fermi Institute] 15:09:23
      Great. Thank you. We have sub raised hands. So, Paulo, his handwriting for a while, go involved

      [Paolo Calafiura (he)] 15:09:30
      yeah, Do I remember? Oh, yeah, yeah, yeah, yeah, I saw one slide in which you mentioned that short term Let's say the next generation would be armor and the next next one generation may be the risk 5 And I'm wondering if you meant for a cpu replacement, and and therefore

      [Andrej Filipcic] 15:09:46
      Right.

      [Paolo Calafiura (he)] 15:09:53
      also, having accelerator. So are you just saying it will be Qr.

      [Paolo Calafiura (he)] 15:09:57
      More pure risk, file.

      [Andrej Filipcic] 15:09:58
      No, our movie has accelerators.

      [Paolo Calafiura (he)] 15:10:00
      Okay, okay, So like, something like.

      [Andrej Filipcic] 15:10:05
      Yeah, something. Yeah, it's not it will be Grace hopper, style, or separate chips, or whatever

      [Paolo Calafiura (he)] 15:10:12
      Okay, okay.

      [Enrico Fermi Institute] 15:10:16
      Okay, there's a hand up for you.

      [Ian Fisk] 15:10:18
      yeah, my question was I to actually one? Was the the project Earth?

      [Ian Fisk] 15:10:24
      Is that a strategic alliance between your Hbc.

      [Ian Fisk] 15:10:26
      And the project. And the is it multiple year? Does it different from the typical peer review?

      [Ian Fisk] 15:10:31
      Oh!

      [Andrej Filipcic] 15:10:31
      Yes, it's completely different, because this is a long term project for at least 10 years, And even so, so, it's exactly like So no.

      [Andrej Filipcic] 15:10:41
      Let me see, let's say.

      [Ian Fisk] 15:10:42
      Okay, so, but is that. But does that? Is the door open to other multi like things like that?

      [Ian Fisk] 15:10:50
      Lcg. Lhc. Negotiating such an arrangement

      [Andrej Filipcic] 15:10:53
      I think so. I mean, the the thing is that European Commission needs to find such projects in interest to support them. And actually those projects are typically listed in S Free Table, where, for example, high luminosity is right

      [Ian Fisk] 15:11:10
      Okay, was there. I I may have missed it.

      [simonecampana] 15:11:13
      I think

      [Ian Fisk] 15:11:17
      But is there a second ex scale machine in France someplace?

      [Ian Fisk] 15:11:22
      I I thought the only one was in Germany. Just understand.

      [Andrej Filipcic] 15:11:23
      David by the the official one that was accepted already. So they're going to proof is in Germany, Jupiter, Franz will likely come next year.

      [Enrico Fermi Institute] 15:11:30
      Okay.

      [Ian Fisk] 15:11:32
      Okay, nice okay.

      [Andrej Filipcic] 15:11:33
      I mean the call for for proposing

      [Ian Fisk] 15:11:39
      Thanks.

      [Enrico Fermi Institute] 15:11:41
      and from Maria

      [Maria Girone] 15:11:43
      maybe just I want to say that, hey, research infrastructure like, Okay, yay is, There's a lot of ongoing discussions.

      [Maria Girone] 15:12:01
      Sandra knows. Well, between let's say, the larger communities and your Hpc.

      [Maria Girone] 15:12:09
      In order to try to motivate further collaborations very much like those programs like this destination Earth, which indeed is a priority for European Commission.

      [Maria Girone] 15:12:21
      But we are also having a number of projects now that will allow us to do.

      [Maria Girone] 15:12:28
      Arindi. Some. I think we'll we, for instance.

      [Maria Girone] 15:12:35
      We will present tomorrow. Indie, So what we're doing with the d Eulich Super computer center for what concerns a development and use of a Gpu resources scale for distributed training the reason a pipeline in a also European project that will allow us to valuate open source

      [Maria Girone] 15:13:00
      solutions like risk 5 and sequel. So there are a number of opportunities and race side very well is very easy.

      [Maria Girone] 15:13:09
      Actually to work on the development side with your Hpc.

      [Maria Girone] 15:13:13
      And the we get granted the resources on for developers.

      [Maria Girone] 15:13:20
      So even within 5 days I mean, we didn't working week, so it's very, very, nice collaboration, at least at this, level, we need to build on this and go further, and that is less obvious.

      [Maria Girone] 15:13:33
      And we require your some common actions, let's say at least, when we are talking to your Hpc.

      [Enrico Fermi Institute] 15:13:47
      Samantha was first. Okay.

      [simonecampana] 15:13:49
      it's a follow-up on. Yes. Question. I think one of the requisites of for entering one of those special programs like I know.

      [simonecampana] 15:14:02
      I don't remember how they're called, but turn up, Grant based the more long term.

      [simonecampana] 15:14:06
      He's a first that you. You are an impactful science, and of course, so you know, it's really I'll be trying to define what is impactful.

      [simonecampana] 15:14:17
      But of course, it's the one who saves the health has a simpler way of demonstrating that impactful Also, if we want to apply for something like, this, I think it's important to make a a lot of progress on the software area.

      [simonecampana] 15:14:32
      because one of the other things one has to demonstrate is to use an Hbc.

      [simonecampana] 15:14:38
      For the value of an Hbc. And already don't use much of the interconnects. That An Hpc. Office.

      [simonecampana] 15:14:45
      So if we are also cheap on Gpus and the use of those architectures, then we become not such a great candidate for one of those one of those programs.

      [simonecampana] 15:14:55
      So. I think we have to build our story, and we have some technical improvements at the software level that now was to to build a better story, and my question is, if there is something like this in the Us.

      [simonecampana] 15:15:10
      because if there was, then we could try to build even a more coherent story across you.

      [simonecampana] 15:15:16
      Open the us.

      [simonecampana] 15:15:22
      Do you have a notion of sciences that get into a program once, and then there is a movie ear engagement with the Hbc facilities

      [Ian Fisk] 15:15:31
      Oh, I I think the dirt might be able to answer better, But I think this is one of the things that the Us.

      [Ian Fisk] 15:15:37
      There is a for definitely a push these days from the Us.

      [Ian Fisk] 15:15:40
      Funding agencies for science to make effect use of the large-scale computing facility.

      [Ian Fisk] 15:15:48
      And so, whether it it's not there's there's not a program like your Hbc.

      [Ian Fisk] 15:15:53
      Because it's only one country, but it does mean that there is a There is alert like if you look at where the national apps of main investments there.

      [Ian Fisk] 15:16:03
      A lot of the investments have been made in Central at facilities with the expectation that the calculations are done.

      [Ian Fisk] 15:16:07
      There.

      [simonecampana] 15:16:08
      Right.

      [Enrico Fermi Institute] 15:16:09
      Yeah, the the thing is, I mean at the moment they're very high level discussions going on that they because there's the push from the funding agencies that we should use more Hbc: And it's not just us it's in general because they pay for these facilities they want them

      [Enrico Fermi Institute] 15:16:25
      to be used, but that now is a pushback, and that's what the conversation is a very high level.

      [Ian Fisk] 15:16:27
      Good.

      [Enrico Fermi Institute] 15:16:32
      Lis mentioned something that there are groups talking about what they need to do in terms of changing their policies, to actually allow that because the the application process for the deed program for the lcc application process the inside application process is just not geared charles towards these used cases competitive poses that are unique that can only be done there which

      [Enrico Fermi Institute] 15:16:56
      is just not a good match, and that's it like this is above all pay scale.

      [Enrico Fermi Institute] 15:17:01
      Here. The these conversations are going on hopefully. Something comes out of it as we'll see

      [Taylor Childers] 15:17:05
      not sure that that is the case. I mean the so.

      [Taylor Childers] 15:17:10
      The insight program offers the opportunity to get up to 3 years of allocation through a competitive review process.

      [Taylor Childers] 15:17:19
      The challenge. Is laying out your case, and I would argue that the way you approach this for the leadership computing facility is you have to play to their mission right.

      [Taylor Childers] 15:17:34
      I mean, their mission is to provide the biggest computers, because people need them, not because Hi good

      [Enrico Fermi Institute] 15:17:40
      But Taylor is this is you basically have to sell it, and you have to sell it in there way that you basically dress it up as something that can.

      [Enrico Fermi Institute] 15:17:48
      You can only do there, and that's not what we want.

      [Taylor Childers] 15:17:52
      I agree, and but I would argue that you can easily make the case based on the fact that you are reaching great scenario, and if you don't get access to the machines, then you'll be able you'll be slower in your science, achievements, and I

      [Enrico Fermi Institute] 15:17:52
      Thanks.

      [Taylor Childers] 15:18:14
      think that's a viable, are you? I think the part where you guys have trouble in, especially an insight program proposal is the fact that you don't have enough of the workloads that take advantage Gpus, right?

      [Enrico Fermi Institute] 15:18:31
      Yeah, that's probably we're trying Lcc: right now.

      [Taylor Childers] 15:18:31
      I mean the challenges is

      [Taylor Childers] 15:18:35
      Yeah, for sure and

      [Enrico Fermi Institute] 15:18:35
      That's easier to justify. I think, if we ever get to the point that you could say, Okay, if we get like a huge insight proposal, we could, You can make the science use case if you can do something you couldn't otherwise, do because it basically adds, 50%, of your own capacity, or

      [Enrico Fermi Institute] 15:18:50
      whatever. But then a little bit kicks inside is still only you.

      [Enrico Fermi Institute] 15:18:55
      You do an allocation proposal, and you get the decision, and then you get it like 3 months later, a few months later, It's too short a time scale.

      [Enrico Fermi Institute] 15:19:05
      You would basically have to ask a year or 2 in advance to fit our planning process within the experiment, because you can't just drop that on top of Cms.

      [Enrico Fermi Institute] 15:19:14
      And expect that we basically throw our plans out the window.

      [Enrico Fermi Institute] 15:19:18
      And now effectively use

      [Taylor Childers] 15:19:19
      Yeah, no, there. There definitely needs to be more discussions above our pay rate.

      [Taylor Childers] 15:19:25
      I mean the challenge. There is to some extent you have to change how the leadership computing facilities are are reviewed so that we can accommodate stuff like that

      [Enrico Fermi Institute] 15:19:44
      you're so I'm sorry. The the you mentioned that obviously these machines are a mixed Cpu Gpu.

      [Enrico Fermi Institute] 15:19:55
      Next generation will they be more well, More of the flops and actual compute power power Use be in the accelerator realm, or will there be some machines where arm sort of provides the every lifting

      [Andrej Filipcic] 15:20:14
      well how does it say hard to predict? But that, in my opinion, there will be always machines built in that way, That's as many user communities can use them.

      [Andrej Filipcic] 15:20:24
      So, and several sites know that already. Right? So nobody will go to a complete dedicated machine, for example, even Jupiter, which is excel scale, It will be easier to build it up and reach the highest top 5 500 number only going only gpu right but they don't

      [Andrej Filipcic] 15:20:45
      want that I mean, nobody would actually want that. So on on cpus.

      [Andrej Filipcic] 15:20:53
      It depends right. But so, still quite many users are used in x 86, right so.

      [Andrej Filipcic] 15:21:00
      But arm is not so difficult in respect. If you use Cpu only part. When you have Gpu it will be slightly different, but arm will definitely be a larger players in the next couple of so something like that

      [Enrico Fermi Institute] 15:21:14
      but my take

      [Enrico Fermi Institute] 15:21:14
      But my takeaway from what you've just said is that at least for the next generation likely to have as a significant Cpu footprint, because they're sort of mandated to be as usable as possible to the communities the the broader comp broader, scientific and such communities

      [Andrej Filipcic] 15:21:35
      Right yup

      [Enrico Fermi Institute] 15:21:36
      right, Okay, thanks. Okay, let's let's move on So there's any other questions for hunger.

      [Enrico Fermi Institute] 15:21:48
      I think we should move on. Hey? Thank you.

      [Andrej Filipcic] 15:21:50
      Well, welcome!

      [Enrico Fermi Institute] 15:21:52
      Okay, we have a couple of slides from some European Cms.

      [Enrico Fermi Institute] 15:21:57
      Efforts, recipes. Daniela isn't connected, I think, unless if he's here, you should speak up.

      [Enrico Fermi Institute] 15:22:04
      He told me he couldn't. So this is this.

      [Enrico Fermi Institute] 15:22:08
      Is integration basically at the Seneca, at the Canal Tier one.

      [Enrico Fermi Institute] 15:22:13
      So they have the co-locator visiting the same data center.

      [Enrico Fermi Institute] 15:22:18
      There is the Seneca Mcconnie, 100 Hbc.

      [Enrico Fermi Institute] 15:22:21
      Which is basically a clone in terms of system architecture to to summit.

      [Enrico Fermi Institute] 15:22:26
      So it's power plus and video, and they they integrated it as a subside of the Tia one.

      [Enrico Fermi Institute] 15:22:32
      So since they're co-located on the same data center, they have really fast network interconnect.

      [Enrico Fermi Institute] 15:22:37
      They tie it together. The Hbc. Can see basically the kind of T.

      [Enrico Fermi Institute] 15:22:42
      One storage system. You saw The services are provided by the data center, and they run it as a subset of the T one.

      [Enrico Fermi Institute] 15:22:52
      So the Cms operations only sees the T one, and then they can internally via some pilot customizations they can select, which parts of the workflow that are centered at Tijuana can run on the Hpc.

      [Enrico Fermi Institute] 15:23:05
      Side, And they basically where we are today, it says, almost complete.

      [Enrico Fermi Institute] 15:23:09
      I think it is complete now, because the announcement came out after the slide was sent to me.

      [Enrico Fermi Institute] 15:23:15
      You see, some slides how it's how it's integrated. So on.

      [Enrico Fermi Institute] 15:23:18
      Did you see it in the in the Monet? The sub-site concept has some unique challenges in how you monitor it.

      [Enrico Fermi Institute] 15:23:27
      Good.

      [Paolo Calafiura (he)] 15:23:27
      I'm sorry. Are we looking at lights because we these slides

      [Enrico Fermi Institute] 15:23:31
      I that I forgot to re share it. Oh, that is yeah, I mean, bring it back up the because in the U.

      [Paolo Calafiura (he)] 15:23:33
      Alright.

      [Enrico Fermi Institute] 15:23:41
      S. All the Hpc sites they're using.

      [Enrico Fermi Institute] 15:23:44
      We basically put the concept of it. See a 3 grid side on top of it, which makes the monitoring and accounting, and so on really easy, because everything is important and there's a unique sign if you have a subside, and is a little bit more difficult because everything is kind of hidden in the under the umbrella, of

      [Enrico Fermi Institute] 15:24:02
      the T one, and then you have to kind of dig into this like some subfields and identify us to, and there has been some work on going in the monitoring and the monitoring sites on Cms to to make that easier Doesn't.

      [Enrico Fermi Institute] 15:24:16
      This model make it easier to accommodate. It makes it makes perfect sense, I mean, for them.

      [Enrico Fermi Institute] 15:24:25
      It's great because they're I mean, they're co-located anyways.

      [Enrico Fermi Institute] 15:24:29
      It's it makes perfect sense for it's a bit more difficult if you're like geographically, is organizationally separate and entities.

      [Enrico Fermi Institute] 15:24:40
      So in the Us. It's kind of difficult, because the Hbc are usually stand alone, So So what is it that has changed between few years ago And now?

      [Enrico Fermi Institute] 15:24:51
      With regard to Cbm Fs. It seems like initially, people were very, very wary of it, You don't want to put this on our Hpc.

      [Enrico Fermi Institute] 15:24:58
      Because it'll crash everything or whatever I mean. Is it is it Technology has gotten better?

      [Enrico Fermi Institute] 15:25:02
      Or is that people have gotten less afraid of it? Maybe people have got less of for it just became familiar with Also, a lot of people are using it, not just us that.

      [Enrico Fermi Institute] 15:25:12
      Helps and then I don't worry about it anymore, because any recent machine with the recent Os no problem running Cdm: Fs: access Yeah.

      [Enrico Fermi Institute] 15:25:24
      It just built my own, and I mean from the ocean side we bring on new sites, because we only use see?

      [Enrico Fermi Institute] 15:25:31
      If you have a phone, Zack, and only if they directly ask can I please run?

      [Enrico Fermi Institute] 15:25:37
      Cvs or if they have any other problems, do we give them the option by Why, why have that conversation? Sure somebody's not hitched into it?

      [Enrico Fermi Institute] 15:25:53
      Okay, even on the Lcf. No problem. Be it worked on, Sayta out of the box.

      [Enrico Fermi Institute] 15:25:58
      Physically it worked on summit, out of the box so I didn't know the issues.

      [Enrico Fermi Institute] 15:26:01
      I have to go click at the squid there so I can actually write it on the batch node.

      [Enrico Fermi Institute] 15:26:05
      But it worked on the logarithm, which runs the same operating system

      [Enrico Fermi Institute] 15:26:10
      And then, Antonio, are you connected? Okay, So Antonio can say a few words on what we're doing at Marinosa

      [Antonio Perez-Calero Yzquierdo] 15:26:12
      Hi! Yes, I am! Can you hear me?

      [Antonio Perez-Calero Yzquierdo] 15:26:17
      Yeah, okay, So yeah, I don't know. Zoom is for is the current supercomputer in?

      [Antonio Perez-Calero Yzquierdo] 15:26:26
      and this is the the largest Sbc center in Spain.

      [Antonio Perez-Calero Yzquierdo] 15:26:29
      I don't know through 5 days. Plan is actually in the procurement face as a explain before so we are accessing bsc and Madamos room as a project mediated by pick So that's the double CD.

      [Antonio Perez-Calero Yzquierdo] 15:26:48
      Spanish tier one, and fortunately, let's say, interestingly, the Lc.

      [Antonio Perez-Calero Yzquierdo] 15:26:54
      Computing has been designated as the strategic project in Vsc program.

      [Antonio Perez-Calero Yzquierdo] 15:26:58
      So this means that we basically well, we still have to request the allocation.

      [Antonio Perez-Calero Yzquierdo] 15:27:02
      But we are getting quarterly grants of about 6 or 7 million hours.

      [Antonio Perez-Calero Yzquierdo] 15:27:09
      A Yeah, available at this for for Cms. And I think it's about the same amount, for for Atlas.

      [Antonio Perez-Calero Yzquierdo] 15:27:16
      So we are getting these allocations. Let's say, regularly.

      [Antonio Perez-Calero Yzquierdo] 15:27:19
      Okay, however, the case is very difficult for for Cms.

      [Antonio Perez-Calero Yzquierdo] 15:27:24
      The environment is extremely challenging, because well, for security reasons, no incoming or outgoing connectivity is allowed in the compute notes.

      [Antonio Perez-Calero Yzquierdo] 15:27:36
      this means that well, everything that needs to happen for for the same, it will run a job like what I have in now, on the on the right hand side, accessing what being connected to the water management being able to to access the software of course conditions data and finally access to storage all these things

      [Antonio Perez-Calero Yzquierdo] 15:27:54
      are at a Yeah, a cat. Basically, all this connection, even, we have in recently discussing the possibility of having some added the it's services.

      [Antonio Perez-Calero Yzquierdo] 15:28:07
      And this is not the not not even this is is allowed.

      [Antonio Perez-Calero Yzquierdo] 15:28:11
      So of course, I shall stop at 4 for Cms, as tasks require.

      [Antonio Perez-Calero Yzquierdo] 15:28:15
      Stephen, services such as the ones I. That is correct.

      [Antonio Perez-Calero Yzquierdo] 15:28:20
      What we have is a login note which allows a site and a share file system mounted on on the execute notes.

      [Antonio Perez-Calero Yzquierdo] 15:28:29
      And And yeah, we can access this. This distributed file system. Be Sh: Fs: So what we are doing Well, he use these capabilities to to build the the model that you can see in the next slide which requires a sensible substantial amount of integration.

      [Antonio Perez-Calero Yzquierdo] 15:28:49
      Work, Yeah, So what the components that that we have, let's say in our favor to make this thing work is, first of all is the condor split Startup.

      [Antonio Perez-Calero Yzquierdo] 15:28:57
      So it uses the the share file system as a communication layer for the job.

      [Enrico Fermi Institute] 15:29:02
      Yes.

      [Antonio Perez-Calero Yzquierdo] 15:29:02
      Management. Well, you can see. Yes, Abc. And D.

      [Antonio Perez-Calero Yzquierdo] 15:29:10
      In the in the in the diagram below, where basically condor is kind of well, it's communicating between the study, and they actual starter where they were.

      [Antonio Perez-Calero Yzquierdo] 15:29:19
      The job run let's say, is communicating via passing files through the file system.

      [Antonio Perez-Calero Yzquierdo] 15:29:23
      Okay, then for software, what we do is basically replicate the Cbm Fs and repositories and Bsc: we, we we get what we need a peak.

      [Antonio Perez-Calero Yzquierdo] 15:29:34
      And then basically send the files and and be in the environment What are the Nbsp.

      [Antonio Perez-Calero Yzquierdo] 15:29:40
      For the conditions. Data is, we cannot access a databases, remote databases.

      [Antonio Perez-Calero Yzquierdo] 15:29:46
      We have to pre fetch those conditions, make them into files, pretty, place them into Bsc.

      [Antonio Perez-Calero Yzquierdo] 15:29:51
      And finally for storage concerns, we have developed our own service for input and output data transfers initially for output. Now for the stage out, let's say now, we are also commissioned this for for like.

      [Antonio Perez-Calero Yzquierdo] 15:30:08
      That so it's kind of white comboluted.

      [Antonio Perez-Calero Yzquierdo] 15:30:12
      The system you can see on the 2 2 extremes of on the diagram, Cern: Of course, the Cms water management system the storage, etc.

      [Antonio Perez-Calero Yzquierdo] 15:30:20
      And, on the other hand, the Bsc. And how we have to build all this intermediate layer at the up.

      [Antonio Perez-Calero Yzquierdo] 15:30:27
      Pick this bridge. Okay, next, please. Yeah, So what's the current status?

      [Antonio Perez-Calero Yzquierdo] 15:30:35
      Okay, system. The system works the services, and infrastructure that we have deployed.

      [Antonio Perez-Calero Yzquierdo] 15:30:41
      this as a allowed us already to to run a test, very reasonable scale.

      [Antonio Perez-Calero Yzquierdo] 15:30:47
      15,006 Cpu cores in in modern option 5.

      [Antonio Perez-Calero Yzquierdo] 15:30:51
      This is realistic Cms jobs, and this is an integrated.

      [Antonio Perez-Calero Yzquierdo] 15:30:56
      Well, I aggregate the output rate of 500 megawatts per second.

      [Antonio Perez-Calero Yzquierdo] 15:31:00
      Okay, So it's capable of sustaining societies.

      [Enrico Fermi Institute] 15:31:03
      Yeah.

      [Antonio Perez-Calero Yzquierdo] 15:31:03
      So the staging out works is commissioner and ready as I'm as I'm mentioning.

      [Antonio Perez-Calero Yzquierdo] 15:31:12
      Yup, probably. Let's say now, Okay, it's actually in discussing this Cms: workloads that can fit into this model.

      [Antonio Perez-Calero Yzquierdo] 15:31:21
      And with the constraint that I explained before. So what we, what would I call realistic Senior Cms workloads so far, this tests are Gen: same task change jobs.

      [Antonio Perez-Calero Yzquierdo] 15:31:32
      For example, in this case, minimum bias production. So it means there is no access.

      [Antonio Perez-Calero Yzquierdo] 15:31:38
      Cool, Okay, or there is no input, data. A full simulation, however, in the style that same as mostly performs, is in the form of a step chain.

      [Antonio Perez-Calero Yzquierdo] 15:31:49
      So it's a single single condor job running all the 4 stages since him did there Rigo, Where in the in this 2, stages they pile up libraries are access be enterprise.

      [Antonio Perez-Calero Yzquierdo] 15:32:04
      Okay, so we can have a triple A, So what we could do in order to be able to run this full step chain is to copy the premix data samples into the Ac.

      [Antonio Perez-Calero Yzquierdo] 15:32:15
      we have, let's say, ask about that this possibility.

      [Antonio Perez-Calero Yzquierdo] 15:32:19
      But but okay, copying data sets all the size about the of about the petabyte.

      [Antonio Perez-Calero Yzquierdo] 15:32:28
      It's not the currently allowed. There's no, there's no capacity in the Karate marinosum for that. Perhaps in modern option.

      [Antonio Perez-Calero Yzquierdo] 15:32:35
      5 dimension, but not that at present. Okay, So that rules out this type of phone simulation.

      [Antonio Perez-Calero Yzquierdo] 15:32:41
      What to look, let's say, and then what we are doing right now is commissioning this stage.

      [Antonio Perez-Calero Yzquierdo] 15:32:48
      The stage in right. So So this customize data transfer service in order to push files from pick a storage for simple, but it even we could get through triple a into peak and then They are into Bsc.

      [Antonio Perez-Calero Yzquierdo] 15:33:02
      In order to enable and running workflows which require input data.

      [Antonio Perez-Calero Yzquierdo] 15:33:05
      For example, we are thinking of participating or enabling broader reprocessing at the admiration.

      [Antonio Perez-Calero Yzquierdo] 15:33:14
      And this is the current situation It's it's not only okay.

      [Antonio Perez-Calero Yzquierdo] 15:33:19
      Let's say, in relation to many things that have been discussed so far.

      [Antonio Perez-Calero Yzquierdo] 15:33:23
      Yeah, in this workshop. It's not only the the the capabilities that that we are allowed, or that actually that we're not allowed to to to have a Dsc: together with how Cms operates for example, step chains are preferred over does change right So this already restricts

      [Antonio Perez-Calero Yzquierdo] 15:33:45
      very much what we can do in in Bfc. I think that's that's it.

      [Enrico Fermi Institute] 15:33:53
      I just wanted to have a call

      [Enrico Fermi Institute] 15:33:55
      I just wanted to have a comment. This was but Antonio showed the split, starter method that this HD Corner integration.

      [Enrico Fermi Institute] 15:34:01
      That's actually what we did. What we used for the Lcf.

      [Enrico Fermi Institute] 15:34:06
      Theta, integration, the prototype integration that we used during the the 2120, 21 Lcc.

      [Antonio Perez-Calero Yzquierdo] 15:34:10
      Yeah.

      [Enrico Fermi Institute] 15:34:14
      It worked that, too. It's it's a little simpler there even then, because you, since you do have edge services that you can call out from the edge.

      [Enrico Fermi Institute] 15:34:22
      So certain things are not quite as complicated as Pcs.

      [Enrico Fermi Institute] 15:34:25
      But we followed the same general integration, principle.

      [Antonio Perez-Calero Yzquierdo] 15:34:28
      yeah, that our case I don't know. I I would say it's particularly interesting because we are really being asked and enforce a right.

      [Antonio Perez-Calero Yzquierdo] 15:34:37
      We we have been asking false to use marinos room bye, from the funding agency point of view.

      [Antonio Perez-Calero Yzquierdo] 15:34:44
      Right, I mean, Oh, yeah, we have the the notion that Cpu is going to be got in in in further incoming request, let's say, funding requests for for our Lc computing projects how about on the other hand Bsc: is not very friendly in terms of allowing things that will

      [Antonio Perez-Calero Yzquierdo] 15:35:06
      make the integrate.

      [Enrico Fermi Institute] 15:35:07
      And at the funding agency have no no way to influence Pcs.

      [Enrico Fermi Institute] 15:35:11
      They can just say No, we don't

      [Antonio Perez-Calero Yzquierdo] 15:35:13
      It's okay. Yeah, it's like A, It's kind of I don't know.

      [Antonio Perez-Calero Yzquierdo] 15:35:15
      I see it as kind of paradoxical, because really we're kind of been trapped between the 2 forces squeezing us in the in the middle.

      [Antonio Perez-Calero Yzquierdo] 15:35:23
      Right? So yeah, it's it's making it quite a lengthy and and and at those project to integrate this.

      [Antonio Perez-Calero Yzquierdo] 15:35:32
      Well, we are advancing. We are trying actually to make it as universal as possible.

      [Antonio Perez-Calero Yzquierdo] 15:35:37
      Let's say, in in relation to Cms workflows, because otherwise it would not be able to.

      [Antonio Perez-Calero Yzquierdo] 15:35:44
      We will not be able to use the resource. But again, it's it's it's difficult

      [Enrico Fermi Institute] 15:35:50
      Okay, Any other questions, comments.

      [Ian Fisk] 15:35:56
      I I had one which which is sort of to Antonio, and sort of, I think, to the larger group which is, do we, hey, Chris?

      [Ian Fisk] 15:36:04
      We want to take advantage of sort of the Wlcp.

      [Enrico Fermi Institute] 15:36:06
      It's

      [Ian Fisk] 15:36:09
      And the sort of the larger organization structures that we have to basically say that network connectivity downside is some is is necessary to work.

      [Ian Fisk] 15:36:21
      I think it's it's really it's it's very impressive technical work to be able to go around this.

      [Ian Fisk] 15:36:25
      But this is something that we could sort of like. I wonder if there'd be any benefits sort of pushing from Mobile

      [Antonio Perez-Calero Yzquierdo] 15:36:34
      yeah, I'm not. I'm not usually involved in the in the political discussions.

      [Antonio Perez-Calero Yzquierdo] 15:36:40
      so I I couldn't tell myself. I I don't know if see money, for example, with the we provide

      [Enrico Fermi Institute] 15:36:47
      I mean, I mean the one thing, Antonio. If you said that they want to reduce your funding for great computing and replace it with Hbc.

      [Enrico Fermi Institute] 15:36:55
      I mean at that point they need at that point I think they expect that that Hbc allocate the capacity kind of counts as a replacement, and don't they need.

      [Antonio Perez-Calero Yzquierdo] 15:36:55
      Yeah.

      [Enrico Fermi Institute] 15:37:07
      Like Ws. G agreement at that point, but they actually consider this to be an equivalent replacement

      [Antonio Perez-Calero Yzquierdo] 15:37:13
      yeah, in principle, the idea is that for for Cpu intensive workloads estimated at about 50% of the Cpu requirement.

      [Antonio Perez-Calero Yzquierdo] 15:37:24
      No request, 50% would be provided by by the Yeah, by Cpc.

      [Antonio Perez-Calero Yzquierdo] 15:37:30
      And then we still would have some Cpu for data processing.

      [Antonio Perez-Calero Yzquierdo] 15:37:34
      Let's say, for the usual Oh, there'd be a one that's kind of the idea.

      [Antonio Perez-Calero Yzquierdo] 15:37:40
      But in order to do that, the yeah, like, I said, we, we we are being forced a kind of to transform this into a and universal resource, which is, which is not yeah, is very much not so

      [simonecampana] 15:37:52
      yeah, to commit to comment. Several people talk to to the funding agents, including myself, talk to the funding agency and the pick, and also to be a C.

      [Enrico Fermi Institute] 15:38:02
      Yeah.

      [simonecampana] 15:38:05
      But it seems to be a triangle that doesn't really understand each other.

      [simonecampana] 15:38:11
      So I think what Antonio is saying is correct. They're trying to push this on.

      [simonecampana] 15:38:17
      The throat, and of course we are trying to push back. Now.

      [simonecampana] 15:38:21
      Of course funding agents. Is not obliged. The pledge right?

      [simonecampana] 15:38:27
      I mean good the funding. It just says, Okay, this is the money we have. And you know, if you want X Tab, you can not okay to use this

      [Ian Fisk] 15:38:35
      Okay, I guess some money that point with, and my point was sort of like.

      [Ian Fisk] 15:38:40
      Did we want to? When we're writing the email we set in relatively strict criteria about, but services needed to be run, and what the expectations were in terms of quality of service, and availability but also in the development of the protocol and this occurs to me as a place where

      [Ian Fisk] 15:38:58
      like the Wsg. Could decide that one of the protocols that's necessary to be considered a site is this?

      [Ian Fisk] 15:39:05
      And it doesn't. It's not guaranteed to work.

      [Ian Fisk] 15:39:07
      But I think that in some sense exciting without it is almost guarantees that it will not change

      [simonecampana] 15:39:14
      Yeah, I mean, it would be useful if the I would say the peak management would make on these for my former request to Wcg: because a a reality what pick has done is to do a lot of diligent work to try to overcome the limitations.

      [Ian Fisk] 15:39:32
      Right.

      [simonecampana] 15:39:34
      hey? It would be good if this would be the other way around, and at some point they would say, cannot do so.

      [simonecampana] 15:39:41
      We can not offer tier one services with this piece of facility, and then we would have a discussion with the funding agency on on those basis at the moment those discussions they led, the not too much, to be honest I don't know if Antonio has more detail that's what I understand also from

      [Enrico Fermi Institute] 15:39:54
      Okay.

      [simonecampana] 15:39:58
      pip

      [Enrico Fermi Institute] 15:39:59
      Could be try to move on and maybe move that offline, because it's not yeah, it's interesting, but it's also it's internal Wsg Spanish funding agency.

      [Antonio Perez-Calero Yzquierdo] 15:40:10
      yeah, Thank you.

      [Enrico Fermi Institute] 15:40:11
      So that's not relevant to the I think we have one more presentation and then we still need to have the cost discussion.

      [Enrico Fermi Institute] 15:40:16
      Yeah. So running a little late. Yes, yeah, let's let's move on Taylor.

      [Enrico Fermi Institute] 15:40:24
      Do you have slides for us

      [Taylor Childers] 15:40:28
      Yeah, I have a few slides

      [Enrico Fermi Institute] 15:40:30
      Okay, great.

      [Taylor Childers] 15:40:36
      Hey!

      [Enrico Fermi Institute] 15:40:40
      Right.

      [Taylor Childers] 15:40:41
      So Hi, this is a disclaimer. This is a disclaimer to make sure I don't do anything silly, but you know the point is this: my own outlook.

      [Taylor Childers] 15:40:52
      On the future. I I'm not presenting any inside information about Yeah, I I don't even know what's coming after.

      [Taylor Childers] 15:41:00
      Aurora. There are people at Argon that do, but not me.

      [Enrico Fermi Institute] 15:41:06
      But but Aurora is still coming right. That's

      [Taylor Childers] 15:41:08
      Yeah, if there's anything is real that Aurora is still coming. That's been the case for far too long.

      [Enrico Fermi Institute] 15:41:21
      still coming.

      [Taylor Childers] 15:41:22
      Yeah, it's still coming. Okay? So going back and updated this plot from a long time ago to provide provide a quick update where things are in the Us.

      [Taylor Childers] 15:41:38
      we've talked about this at length. At this point but I think it's also useful to look at it in the context of the Lhc.

      [Taylor Childers] 15:41:48
      Runs right By the time the high Lumi Lhc turns on, we're gonna be dealing with the machines.

      [Taylor Childers] 15:41:54
      We don't even know what they look like yet, and a lot can happen between now and then that can affect how those machines look.

      [Taylor Childers] 15:42:05
      So we now have frontier deployed, so the Us.

      [Taylor Childers] 15:42:11
      Has its first ex- scale machine. We'll have Aurora coming online by the end of the year, and the next generation, which machines, you know, like, I said, we don't know what those are everything that we have is.

      [Taylor Childers] 15:42:26
      Sort of intel Nvidia Amd. I would come expect these to follow similar trends amazingly because of politics of it.

      [Taylor Childers] 15:42:37
      All right. I mean, we're spending us taxpayer money, and they want that to go to us corporations.

      [Taylor Childers] 15:42:44
      so I expect those will stay static. But of course, the variation in combinations, you can already see, are quite large, so those can still change

      [Taylor Childers] 15:43:00
      just a quick Put that in perspective. So I included the Japanese recent machine that they deployed the European machines that are have been announced I'm pretty sure there that this was confirmed in andrea slide or the slides on the euro.

      [Enrico Fermi Institute] 15:43:20
      Yeah.

      [Taylor Childers] 15:43:25
      Apc. That there's gonna be one more X access.

      [Taylor Childers] 15:43:29
      Give a machine, announced. So we know Jupiter is coming, and the plan was all right always to have €2 Hpc excel machines before 25.

      [Taylor Childers] 15:43:42
      I include China on here in principle they already have 3 ex scale machines, and in 10 to have 10 by 2425.

      [Taylor Childers] 15:43:52
      That's their goal. There's no reason they can't do that.

      [Taylor Childers] 15:43:55
      They seem to be willing to burn as much coal as possible to keep these machines at the Exa scale.

      [Taylor Childers] 15:44:02
      as I understand it, this one is just a giant.

      [Taylor Childers] 15:44:05
      Oh, no! That Tiana 3 is a giant upgrade of the 2.

      [Taylor Childers] 15:44:09
      So it's just a bunch of cpus, and there is no energy budget there.

      [Taylor Childers] 15:44:13
      So it's you know, a hot machine. The interesting thing about all of these is that they have various architectures that are very different.

      [Taylor Childers] 15:44:28
      Europe, has gone heavy into arm and eventually will go into the risk.

      [Taylor Childers] 15:44:33
      V. As an open source, accelerator format.

      [Taylor Childers] 15:44:37
      they're also, you know, into the sovereign.

      [Taylor Childers] 15:44:42
      Technology is. Everybody wants to, You know, there's stuff built here.

      [Taylor Childers] 15:44:47
      so the Japanese are using fruitsu chips.

      [Taylor Childers] 15:44:51
      The Europeans are trying to design their own I wouldn't be surprised if the arm and the risky stuff changes in the year in you, because I know you know Intel has already announced they're gonna open some boundaries in Europe and I think that's kind of help their image in the

      [Taylor Childers] 15:45:11
      area, so we'll see

      [Taylor Childers] 15:45:16
      So just a quick that look at at the distribution of of architectures.

      [Taylor Childers] 15:45:22
      So I took the top. 500. I made the cut off, and had to be bigger than 10 Petaflops.

      [Taylor Childers] 15:45:28
      That leaves me at about 50 machines, and I just flaps with the architectures frontier, really heavily dominates this now, so you can see, you know, The Amd, cpus and gpus from an ex scale machine compared to Everyone else.

      [Taylor Childers] 15:45:47
      so you can see right now, you know, outside of frontier in videos, really dominating the accelerators, there's a nice distribution of of cpus, and then I went ahead.

      [Taylor Childers] 15:46:03
      To 26, and tried to do the same plot.

      [Enrico Fermi Institute] 15:46:05
      Okay.

      [Taylor Childers] 15:46:10
      For what I think is coming. So by 2026 Us.

      [Taylor Childers] 15:46:16
      And Europe will both have 2 X and scale machines, like said China will have up to 10.

      [Taylor Childers] 15:46:20
      I didn't include the Chinese in this number largely because I mean, I have no idea the technique technicalities of what they're going to be running.

      [Taylor Childers] 15:46:32
      You're up has at least put out a roadmap, so their goal is to be using these arms, and the risky accelerators.

      [Taylor Childers] 15:46:41
      So if I include those at sort of, you know, over an exaflop.

      [Taylor Childers] 15:46:48
      then you start seeing this distribution. So you see, there's arm amd intel on the Cpu side, and a Amd.

      [Taylor Childers] 15:47:00
      Intel, And then this is essentially that risk v processor?

      [Taylor Childers] 15:47:04
      So if the Europeans decide to move to Nvidia or Intel, or Amd.

      [Taylor Childers] 15:47:11
      This green blob here will shift so you can see the The variation is, you know, early equal.

      [Taylor Childers] 15:47:24
      So then there's specialty hardware. So the du is has always been strong at in partnering with industry.

      [Taylor Childers] 15:47:33
      We really like pushing collaborations with industry. Alcf.

      [Taylor Childers] 15:47:40
      Host, the Doe Ai Test band, and currently we have 5 machines that are all custom silicon that are designs for running large learning jobs And so we've been working with those developers testing out their software And whatnot there's definitely an interest in identifying one or

      [Enrico Fermi Institute] 15:47:44
      Okay.

      [Taylor Childers] 15:48:04
      2 that you know, scientists like best, and then moving along with maybe making those as side car side cards to some future supercomputer. Right?

      [Taylor Childers] 15:48:18
      So you could imagine having the, you know, a couple of racks of these specialized chips available to you, to run your your Ai much much faster than a traditional Gpu or Cpu the other thing I wanted to say moving forward i'm close by Dorothea my kids are coming home to

      [Taylor Childers] 15:48:43
      school. The other thing is I wanted to mention was, of course, Ai for science, and in the context of Ecp so many of you Will be familiar with Ecp: The ex scale computing project Yeah.

      [Enrico Fermi Institute] 15:49:01
      Cool.

      [Taylor Childers] 15:49:01
      Was a large funded project on the Oscar side that you know The last number I heard is in principle.

      [Taylor Childers] 15:49:13
      It funded about a 1,000 ftees across the and it was all geared toward preparing for ex scale machines.

      [Taylor Childers] 15:49:24
      now with the landing of our 2 access can systems, This project's going to be ramping down, and there's a lot of worked to figure out what's going to come next.

      [Taylor Childers] 15:49:39
      And it really looks like Ai, for science is the next big push, so they're already.

      [Taylor Childers] 15:49:46
      It's already been 2 years now worth of workshops.

      [Taylor Childers] 15:49:50
      on the Oscar side, where we are trying to lay out the green ground Work for what such a project would look like, and how it would be managed, and what its goals would be so I expect that in the next you know 5 years that this is gonna be sort of a dominating.

      [Taylor Childers] 15:50:13
      force, just like Ecp. Was so just something to be aware of.

      [Enrico Fermi Institute] 15:50:15
      Thank you.

      [Taylor Childers] 15:50:19
      I think that's going to have a big impact on it.

      [Taylor Childers] 15:50:24
      How our systems look Yeah, in this next round of deployments.

      [Taylor Childers] 15:50:31
      So? Are there any. So the takeaways, I would say, future of architecture, and hpc facilities is quite diverse.

      [Taylor Childers] 15:50:40
      I expected to remain so, There might be some custom hardware, but it will be very niche is what I expect for Ai, and you'll just be picking up tensorflow and pike torch and running your software the way You would anywhere else.

      [Taylor Childers] 15:50:54
      I would say the software implications There are the using portable frameworks will be a benefit, and of course, the more we can complain and and voice our are interest in a standard support theme through the C standard.

      [Taylor Childers] 15:51:16
      2 companies I think that you know it's a good thing, but until everyone supports something like Std.

      [Taylor Childers] 15:51:23
      Par out of C standard, you know, using these third party libraries like cocos and Sickle and Peca, are probably gonna be the best way to go for the moment let's see, current ex scale machines.

      [Taylor Childers] 15:51:38
      I were largely decided before Ai became a real focus.

      [Taylor Childers] 15:51:43
      And do we science? And I expect that to be a bigger driver for the next round of systems that are coming that might again, of course, with the end, is in the energy budgets and competitive nature of these machines will probably driving them in the direction accelerators again, but things?

      [Taylor Childers] 15:52:07
      Shift quickly. It's hard to predict. So yeah, that's where I I leave that

      [Enrico Fermi Institute] 15:52:19
      But Tara had a quick question. I think it's on slide 3 where he kinda made the pie charts of.

      [Enrico Fermi Institute] 15:52:26
      yeah, if if you would try to make a single pie chart right?

      [Enrico Fermi Institute] 15:52:33
      If it's the problem pie charts, you can't tell the relative size how much larger is the Gpu flops currently versus the the Cpu flop.

      [Enrico Fermi Institute] 15:52:42
      Is there? Is there a way to get a don't all to to a single one?

      [Taylor Childers] 15:52:48
      Yeah, I mean. So any system that has accelerators can be dominated right Last time I calculated that was like probably was Summit, and there was, you know, on the level, 5 to 10 with Cpu flops.

      [Enrico Fermi Institute] 15:52:54
      Yeah.

      [Taylor Childers] 15:53:06
      and it got even worse whenever I did. The calculation for frontier and Aurora.

      [Taylor Childers] 15:53:13
      But it's been a long time since I looked at those

      [Enrico Fermi Institute] 15:53:17
      So I guess the point is, if it was drawn to scale like the Gpu pie chart would be 10 times larger than the Cpu, or 5 times 10 times not not the same size right

      [Taylor Childers] 15:53:23
      That's right.

      [Taylor Childers] 15:53:29
      For sure, for sure.

      [Enrico Fermi Institute] 15:53:32
      And and you're timing in what? What is other of the Gps here?

      [Taylor Childers] 15:53:36
      So

      [Enrico Fermi Institute] 15:53:38
      Is that the

      [Taylor Childers] 15:53:40
      Yeah, So that would be in this case. That would be the fidget suit

      [Enrico Fermi Institute] 15:53:47
      Okay.

      [Taylor Childers] 15:53:50
      I can look back in my spreadsheet, too.

      [Enrico Fermi Institute] 15:54:01
      They probably also explains why Barb is a larger piece than Kelvin

      [Taylor Childers] 15:54:06
      Oh, no! Sorry. In this one. The other is the T. On a 2, which is on the 500, and if one of these it's this one

      [Enrico Fermi Institute] 15:54:13
      Okay.

      [Enrico Fermi Institute] 15:54:21
      Okay, if you told me that was a 386 ship, I'd also believe you.

      [Enrico Fermi Institute] 15:54:26
      So okay, So Taylor performance portability. So if they does, that mean if it is, decide on a system design, they make the Lcf.

      [Enrico Fermi Institute] 15:54:39
      Or whatever fun stuff makes sure that it's supported by the performance.

      [Enrico Fermi Institute] 15:54:44
      Portability, libraries.

      [Taylor Childers] 15:54:46
      Well, and I think that's the benefit of something like Co.

      [Taylor Childers] 15:54:51
      Coast, which is a really it's a third party, the support right?

      [Taylor Childers] 15:54:55
      So Cocos came out of the Ecp project, and I imagine we'll continue to be supported.

      [Taylor Childers] 15:55:06
      and since it's third party, they can just come in and write a new plugin for whatever you know New Orleans comes along, and so as long as you use it, you paying the benefit from that I was when we first got we first, we're working with intel and sickle I was

      [Taylor Childers] 15:55:31
      very skeptical of sickle I mean, I'm in general.

      [Taylor Childers] 15:55:35
      I'm so skeptical of especially telling scientists to invest their time in the solution that's being pushed by one of the manufacturers.

      [Taylor Childers] 15:55:47
      Right I mean Cuda is a mess as a You know, someone who came up in in the sciences writing code.

      [Taylor Childers] 15:55:55
      I would never wish anyone to write code in Cuda, and so I approach sickle in the same respect.

      [Taylor Childers] 15:56:07
      but I mean it's getting good performance and it allows you to write your code once, and so far we've been able to run it on all 3 systems.

      [Taylor Childers] 15:56:17
      We run it, at least with Matt Graph. We have a sickle implementation, and it runs on the Amds, the Intel, and the Nvidia Gpus without any problem, and does very well, and Cocoa is the same with and like you said the nice thing about those 2 is that you write

      [Enrico Fermi Institute] 15:56:32
      See.

      [Taylor Childers] 15:56:37
      your code once, but with cuda the coulda implementation of ad graph right now is a riddled with compiler pre-compiler if depths everywhere, because if you're not on a computer device you need to run the C and they you know, it just becomes really hard to

      [Taylor Childers] 15:56:56
      maintain for someone who's not the dedicated software

      [Enrico Fermi Institute] 15:57:07
      Still have to cover the Hpc. Cost, I would like to at least attempted it to go through the slide where you have to see.

      [Enrico Fermi Institute] 15:57:14
      Okay, there's too long. Eventually we might have to cut it off and move it to tomorrow or something.

      [Enrico Fermi Institute] 15:57:18
      Yeah, we could could start a little earlier tomorrow I don't know how people feel about that.

      [Enrico Fermi Institute] 15:57:24
      Yeah, thanks, Taylor. Appreciate it. So let's try to go to the Hpc.

      [Enrico Fermi Institute] 15:57:32
      Cost, and then we're right up on the Yeah.

      [Enrico Fermi Institute] 15:57:35
      There was a question on the charge or remember to share.

      [Enrico Fermi Institute] 15:57:41
      At this time the total cost of operating Hbc resources, and they especially included the the outlook to each, and and the thing is the the cost of operating it I mean This is really about operation acquiring and operating because you nominally they're free I mean

      [Enrico Fermi Institute] 15:58:02
      eventually there's some indirect effect, because you get them from the same funding agencies.

      [Enrico Fermi Institute] 15:58:07
      That fund you purchase hardware, but that's indirect, and that's also also the scope of this in this workshop.

      [Enrico Fermi Institute] 15:58:14
      So you you basically have to prepare your proposals once per year, usually access allows supplementals.

      [Enrico Fermi Institute] 15:58:22
      there's work on multi year proposals, and maybe that will mean that you still have to do a proposal each year.

      [Enrico Fermi Institute] 15:58:30
      But you don't have to do much work for it.

      [Enrico Fermi Institute] 15:58:31
      You just sign it off with your request. You already know what you're getting, and but this is a work in progress, and then there's technically integration, permissioning Mark and that's mostly one time.

      [Enrico Fermi Institute] 15:58:43
      is it you you integrate a facility once you find a way to make it work, and then you just have to maintain what you came up with, and this needs to be redone every free year.

      [Enrico Fermi Institute] 15:58:56
      Because these Hbc have a limited lifetime.

      [Enrico Fermi Institute] 15:58:58
      Basically, 5 years is around the maximum expect replace it with a different machine.

      [Enrico Fermi Institute] 15:59:03
      The what we experienced so far is the synergy effects.

      [Enrico Fermi Institute] 15:59:07
      If you stay within the same facility, because usually they have similar restrictions similar ways to do things so switching from one to to another cluster in the same facility, that when they do a replacement you you don't have to throw out everything and stuff from scratch you just make adjustments to what

      [Enrico Fermi Institute] 15:59:27
      you probably did before. It's there's an open question on the Lcf.

      [Enrico Fermi Institute] 15:59:34
      Integration, at least for a Cms Side I mean, you have your harvesteds for us at least long-term operational overheads.

      [Enrico Fermi Institute] 15:59:42
      There, a little harder to estimate They're likely also larger there, because the provisioning integration looks like it's gonna be a bit more complex, and not tight neatly into what we're doing anyways.

      [Enrico Fermi Institute] 15:59:57
      For the good size, So you need to do something special. Then support.

      [Enrico Fermi Institute] 16:00:02
      I mean, that's one of the things that came up in the context of pledging.

      [Enrico Fermi Institute] 16:00:07
      It's something you need to be able to send a ticket.

      [Enrico Fermi Institute] 16:00:11
      So there's operation support, because you don't have less Cms side contact.

      [Enrico Fermi Institute] 16:00:16
      Now, admittedly the grid says, Dt. Twos.

      [Enrico Fermi Institute] 16:00:19
      The side context is also someone usually the operations program baseball.

      [Steven Timm] 16:00:23
      hmm.

      [Enrico Fermi Institute] 16:00:24
      Is not that this is necessarily cost. That's unique to the Hbc.

      [Steven Timm] 16:00:29
      well, that

      [Steven Timm] 16:00:30
      Well, that I mean, if there's a problem if there's a problem in earthquake. Now, have call, team This is Jiggis ticket, and we respond to it.

      [Enrico Fermi Institute] 16:00:31
      Yes.

      [Enrico Fermi Institute] 16:00:38
      Yes, exactly. That's what I mean. I mean the T.

      [Steven Timm] 16:00:40
      So here cause here is the same contract

      [Enrico Fermi Institute] 16:00:42
      2. If there's a problem at Wisconsin, you filing a ticket, and the person that we pay money to, or funds to from the operations program.

      [Steven Timm] 16:00:51
      Okay.

      [Enrico Fermi Institute] 16:00:53
      At this constant response to it. So in that sense, it's not that different from Porting for side operations And and again, the other great example is the the door to grid folks use experiment specific oops.

      [Steven Timm] 16:00:55
      Good.

      [Enrico Fermi Institute] 16:01:09
      Teams are even W Someg: specific offs. Teams can be fairly far separated from the okay.

      [Steven Timm] 16:01:16
      Yes.

      [Enrico Fermi Institute] 16:01:19
      The the the people who are actually operating cluster.

      [Steven Timm] 16:01:20
      Yeah.

      [Enrico Fermi Institute] 16:01:21
      Yeah, yeah, and then I want to break that operation support into 2 components.

      [Enrico Fermi Institute] 16:01:27
      Because one is just normal work for support, just dealing. Oh, you have a lot of failures.

      [Enrico Fermi Institute] 16:01:33
      Can. You look into it? And you look at not funds, or whatever usually debugging of job failures And to first this a scales with the amount of resources because the more work you pass through the more problems you can expect and there's there's overlap here, with the normal

      [Enrico Fermi Institute] 16:01:50
      operations support by experiment, so that the first line so it defends that basically monitors overall workflow computing operations.

      [Enrico Fermi Institute] 16:01:59
      And then it goes to the point up to the point where you open the gigos ticket against the side, and then the second motors.

      [Enrico Fermi Institute] 16:02:07
      Then once, said, Geez, ticket is open. They're going to decide.

      [Enrico Fermi Institute] 16:02:09
      Whoever responds we'll have to have specialized Hbc integration knowledge, because some of these failure modes can be specific to how that Hpc.

      [Enrico Fermi Institute] 16:02:20
      Was integrated, and that that implies that there's a long term need to keep commissioning expertise around.

      [Enrico Fermi Institute] 16:02:28
      But we probably need to do that anyways, because of the Hbc.

      [Enrico Fermi Institute] 16:02:35
      Cluster, turnover. So you need to do the the commissioning efforts need to be redone.

      [Enrico Fermi Institute] 16:02:40
      So that's kind of if you're talking many Hpcs so there's constantly a need to work on this stuff We've been doing this long enough.

      [Enrico Fermi Institute] 16:02:48
      Can't you estimate what those labor costs are?

      [Enrico Fermi Institute] 16:02:52
      zoom ftes. Yeah, you can. You can try to come up.

      [Steven Timm] 16:02:54
      Right.

      [Enrico Fermi Institute] 16:02:55
      I mean, we've done it for multiple years, I can for the user facilities.

      [Steven Timm] 16:02:57
      Oh!

      [Enrico Fermi Institute] 16:03:00
      You definitely can do it. The Lcf. As I said, I'm unsure because I don't know what the long-term stable operations.

      [Enrico Fermi Institute] 16:03:08
      Mode will look like at the moment that still need to be done.

      [Enrico Fermi Institute] 16:03:11
      But the user facility is definitely, We can come up with an essay and then with Tlcs.

      [Steven Timm] 16:03:14
      Right. I mean

      [Enrico Fermi Institute] 16:03:17
      Can you write down? Why, you can't get what you need from that, so that the document you can make an estimate.

      [Enrico Fermi Institute] 16:03:25
      But you can qualify it. No, no; What I mean is, you can do it in the user facility right?

      [Steven Timm] 16:03:27
      Right.

      [Enrico Fermi Institute] 16:03:30
      And then because they have these these properties in the Lcs. You can't.

      [Steven Timm] 16:03:34
      Right.

      [Enrico Fermi Institute] 16:03:35
      You can put some error. Bars, but they're missing these properties.

      [Enrico Fermi Institute] 16:03:39
      They had those properties that the user facility had. Would that allow you to give a more perspective estimate for the Lcs.

      [Enrico Fermi Institute] 16:03:45
      You see what I'm saying Obviously, something about the way the user facilities are set up.

      [Steven Timm] 16:03:45
      Okay, Well.

      [Enrico Fermi Institute] 16:03:51
      The Steve on Steve, Steve.

      [Steven Timm] 16:03:52
      Yes, hey! You you have 2 components. So what is the meanings?

      [Steven Timm] 16:03:59
      Were one of them is when the remote site changes, their Api.

      [Steven Timm] 16:04:03
      The way you have to log in. Okay, done 4 times in 6 years.

      [Steven Timm] 16:04:07
      Now breaking, breaking, if here is that we used, and having to change it.

      [Steven Timm] 16:04:13
      So that's one end of things. So I mean, this is fairly straightforward.

      [Steven Timm] 16:04:19
      I mean this is that's the moment. You should expect that it would change the other part of it is stuff, but upstream of us, for instance, I'm talking it's organization.

      [Steven Timm] 16:04:31
      I mean There, we still haven't quite Got done. All the various hecks that are done to get into the Hpc.

      [Steven Timm] 16:04:40
      Sites don't necessarily translate, as well as a regular site would need more work to be done.

      [Steven Timm] 16:04:43
      There. So if you have a big change in the upstream, most G, or things like that that can really throw us for loop

      [Enrico Fermi Institute] 16:04:53
      That's what I meant by technical integration commissioning work.

      [Enrico Fermi Institute] 16:04:56
      That there's a long-term maintenance effort.

      [Steven Timm] 16:04:56
      Alright.

      [Steven Timm] 16:04:59
      Well, it

      [Enrico Fermi Institute] 16:04:59
      There's always there was a bit special, so there's always the chance that something will break, and you have to do

      [Steven Timm] 16:05:05
      Right. You need somebody that can read it. Understand? Factory logs, basically.

      [Steven Timm] 16:05:08
      And in, and he called me got it

      [Enrico Fermi Institute] 16:05:11
      And at the maintenance isn't necessarily a evenly distributed.

      [Enrico Fermi Institute] 16:05:15
      For instance, no so much type thing right? Sometimes 6 months nothing happens, and then like something goes boom.

      [Steven Timm] 16:05:17
      Great Great. Hey? Then you have to allow for the fact that some of these people don't answer their tickets very well at all.

      [Steven Timm] 16:05:28
      Yeah, in particular, just good. So maybe he's got a thing to people who listen to them.

      [Steven Timm] 16:05:38
      We'd like to hear it, because we have very little luck

      [Enrico Fermi Institute] 16:05:44
      And

      [Steven Timm] 16:05:45
      okay.

      [Enrico Fermi Institute] 16:05:47
      Okay. But I think we can. We can do. We can do an attempt here to to estimate us in terms.

      [Steven Timm] 16:05:52
      Yeah, yeah, yeah, sure.

      [Enrico Fermi Institute] 16:05:52
      Of fts, we can probably on a S existing has to be said. We have for good size 52 sites, which is also an index.

      [Steven Timm] 16:05:59
      Well.

      [Enrico Fermi Institute] 16:06:02
      So science to to

      [Steven Timm] 16:06:02
      So the the amount of effort there to help up with into maintenance is well known.

      [Enrico Fermi Institute] 16:06:07
      Yeah, but I also

      [Steven Timm] 16:06:09
      And so basically 30% of me, basically, that's what it is.

      [Steven Timm] 16:06:14
      So

      [Enrico Fermi Institute] 16:06:15
      So, but all fts are not created equal, so somehow you have to capture the skill set that F, T. E.

      [Steven Timm] 16:06:18
      Good.

      [Enrico Fermi Institute] 16:06:22
      S. Yeah, Then that's harder to do in terms of a high-level document to I know it's harder, but you have to.

      [Enrico Fermi Institute] 16:06:35
      Good. But well, in yeah, Atlas and Cms have solved the same problem.

      [Enrico Fermi Institute] 16:06:40
      2 slightly different ways, and that requires 2 different skill sets a political and ethical.

      [Enrico Fermi Institute] 16:06:47
      The the one that I real like that we should hammer on is the difference of these costs.

      [Enrico Fermi Institute] 16:06:54
      For Lcf type type Facility versus user. So I think you could probably to communicate that more effectively.

      [Enrico Fermi Institute] 16:07:03
      That's probably That might be the order. Sure.

      [Steven Timm] 16:07:04
      oh! I mean, there's ongoing dove work and there's gonna be ongoing dev work on the Lcf side, too.

      [Steven Timm] 16:07:11
      I mean good, significant dev work. There.

      [Enrico Fermi Institute] 16:07:12
      Yeah, that's the But that's a one-time cost.

      [Enrico Fermi Institute] 16:07:14
      We also will want to try to estimate what the long-term operational support is, and there will be large Arab bars.

      [Enrico Fermi Institute] 16:07:22
      But we can. You can make an attempt

      [Steven Timm] 16:07:23
      Right.

      [Enrico Fermi Institute] 16:07:26
      And then there's another apart from the cost and effort, efforts that are directly associated with Hbc operations.

      [Enrico Fermi Institute] 16:07:36
      There's a secondary component. That's a bit more indirect and harder to estimate, but it will come into play at some point as we scale up Hpc: operations that we need hardware and services and grid sides to support this data job flows at the

      [Enrico Fermi Institute] 16:07:51
      Hbc's

      [Enrico Fermi Institute] 16:07:53
      Because you didn't put on as a cost, but the payload cost.

      [Enrico Fermi Institute] 16:07:58
      So. In other words, the as we just heard Europe in the Us.

      [Enrico Fermi Institute] 16:08:03
      The next generation. Big machines. We'll have more and more accelerators is how the flop They're fun, you know.

      [Enrico Fermi Institute] 16:08:12
      It? Do you? Molly will have Cp only party on D cause for porting things to Gpu is was specifically excluded out of scope for The school. I understand but we have to explain that that is something that will probably have to be handled because that you know, obviously cms is because cpus are in your

      [Enrico Fermi Institute] 16:08:32
      trigger You guys are a little bit farther ahead than Atlas.

      [Enrico Fermi Institute] 16:08:36
      I mean, we will put that in as a component, but we're not going to put any effort level on it, because you can, because you don't know you don't. But it's not its goal, for this for this government it's not supposed to be its goal.

      [Enrico Fermi Institute] 16:08:49
      another strategic thing you could talk about here is what's common verses?

      [Enrico Fermi Institute] 16:08:59
      What's the experiment? Specific, hey? Yeah, Yeah, keeping it at the the leading order type things.

      [Enrico Fermi Institute] 16:09:08
      If we go through the presentations that find overlaps, then call out, because again, when it comes to cost, you need to think about how how the agencies view Hmm!

      [Enrico Fermi Institute] 16:09:23
      They they do like to see common activities.

      [Enrico Fermi Institute] 16:09:30
      you can't make things that are common, not common.

      [Enrico Fermi Institute] 16:09:33
      So you you It would be death to say everything is the same, because I think sure if I rescue for a baby, I'm happy.

      [Enrico Fermi Institute] 16:09:43
      But trying to to call that out can be a strategic way to help people look at the cost

      [Enrico Fermi Institute] 16:09:54
      Steve, I see your hands still up. Did you? Did you have another comment?

      [Steven Timm] 16:10:00
      no, I was no.

      [Enrico Fermi Institute] 16:10:02
      Alright on that last bullet. Oh, no!

      [Enrico Fermi Institute] 16:10:10
      This is us.

      [Enrico Fermi Institute] 16:10:23
      When you get to the report writing, I mean, if I had a better way to to state that doesn't have to be I mean. So So what do that I would highlight?

      [Enrico Fermi Institute] 16:10:33
      This Does have to be inquired. Sites, For example, if you think of the the spin work at at Ersk might be perfectly fine.

      [Enrico Fermi Institute] 16:10:43
      So I mean so. Is it not really about it? Services? No.

      [Enrico Fermi Institute] 16:10:51
      because if, for instance, you wouldn't need globus and all that, if the Wlcg data grid could talk as an equal nurse could be an equal member to the Wwlc: data grid, you would not have to do any sort of translation jump through, Hoop step if

      [Enrico Fermi Institute] 16:11:11
      Alcf had a gatekeeper or some other something equivalent that we could.

      [Enrico Fermi Institute] 16:11:18
      We could both submit jobs to with tokens that would be.

      [Enrico Fermi Institute] 16:11:22
      That's an example of an edge service that would be common development.

      [Enrico Fermi Institute] 16:11:25
      That would make the cost easier for that. But but that's that's I include that more in the technical integration and long-term maintenance, And that's stuff that's happened. I'll need at the hpc sites I would include there.

      [Enrico Fermi Institute] 16:11:41
      That's my properties. Last board is. Say that you have services at great sites is a solution 37.

      [Enrico Fermi Institute] 16:11:51
      You could turn that ball baby into additional operated services for Hpc.

      [Enrico Fermi Institute] 16:11:57
      As opposed to say, services at grid sites, but that is a dollar cost.

      [Enrico Fermi Institute] 16:12:03
      That money was spent. Yeah, Yeah, and it was to work around the deficiency.

      [Enrico Fermi Institute] 16:12:09
      But but the point is, does that not fall under the the prior to bullets?

      [Enrico Fermi Institute] 16:12:19
      It. What what I thought to include here, We'll have a discussion on that, later, because there's some integration, hypotheticals and impact on the rest of the collaboration.

      [Enrico Fermi Institute] 16:12:30
      It's more about like. Assume you have from a lab is a big star site for Cms in the Us.

      [Enrico Fermi Institute] 16:12:36
      And assume you put the difference between putting 50,000 extra Cpu.

      [Enrico Fermi Institute] 16:12:41
      Sorry me lab, and having fair 50,000 cpus somewhere else.

      [Enrico Fermi Institute] 16:12:46
      This network and kinda external data serving and transport links.

      [Enrico Fermi Institute] 16:12:51
      Okay. So it's especially, but in terms of capital equipment, I mean.

      [Enrico Fermi Institute] 16:12:56
      So what we could do to say Service operations for services, support, cost, and call that out separately from operations, support.

      [Enrico Fermi Institute] 16:13:05
      But if you're really thinking the hardware call hardware out separate That's that's a very different color of money.

      [Enrico Fermi Institute] 16:13:15
      That's hardware. The last bullet is is hardware.

      [Enrico Fermi Institute] 16:13:18
      I can tell you how much we spend. Yeah, So as I wrote the Rbt: Yeah, in that case, don't don't mix it in with certain.

      [Enrico Fermi Institute] 16:13:27
      Have have a hardware. Only bullet right?

      [Enrico Fermi Institute] 16:13:32
      And that that hardware potentially needs renewed right.

      [Enrico Fermi Institute] 16:13:36
      Of course, if we need it, you I need it. I mean what I mean is, if we need, if we continue to need it, we have to continue to fund it so I would just put that last one into at least 2 calls.

      [Enrico Fermi Institute] 16:13:47
      Yes, okay, yes, I think that was the last time we had for today.

      [Enrico Fermi Institute] 16:13:53
      That is, are you thinking at the end for any other strategic report?

      [Enrico Fermi Institute] 16:13:57
      On December or whatever to have a dollar range Here Is that the install, or just pointing out they considerations that need to be made and

      [Enrico Fermi Institute] 16:14:10
      We are specifically. We were discouraged from comparing Hpc.

      [Enrico Fermi Institute] 16:14:16
      Cloud cost 2 great costs, and it was a little bit of a I can force, but at the end that's the decision that was made.

      [Enrico Fermi Institute] 16:14:24
      So we should Just tried to come up with some cost on their own.

      [Enrico Fermi Institute] 16:14:29
      So with comparison. But I mean, are you saying for user facility? Like nurse?

      [Enrico Fermi Institute] 16:14:34
      We need between x

      [Enrico Fermi Institute] 16:14:40
      we'll put an Fde number different, depending on where, as an Unc.

      [Enrico Fermi Institute] 16:14:51
      Cost cost. Can you also phone? And should be also folded?

      [Enrico Fermi Institute] 16:14:54
      X amount of Cpu cores. Efficient running means.

      [Enrico Fermi Institute] 16:14:59
      Why amount of disc at the site, so that if we can't get the Y.

      [Enrico Fermi Institute] 16:15:05
      Amount of disk through the grant procedure, then that would actually be a cost, because you would have to do the condo model of buying storage. Well, that's why like just having a separate hardware bullet where the hardware sets you gotta you gotta I mean obviously you care. Where the

      [Enrico Fermi Institute] 16:15:26
      hardware sits, but they'll have, but there will be a a capital outlet

      [Enrico Fermi Institute] 16:15:32
      If this last part to the discussion this morning about data delivery, and having significant cash, or I did in point at the Hpcs.

      [Enrico Fermi Institute] 16:15:45
      If you wanted to do it that way. I don't mean to.

      [Enrico Fermi Institute] 16:15:49
      I guess the idea is that that would come through an allocation if it's part of the facility, right?

      [Enrico Fermi Institute] 16:15:53
      So maybe that is a department If they give us a storage, then it comes from the Yeah.

      [Enrico Fermi Institute] 16:15:58
      But if we get very little storage that puts a lot of pressure on a network and then storage somewhere else, because you have to be very.

      [Enrico Fermi Institute] 16:16:06
      She She can think of it this way. I get 500 pirates with my allocation, but I need a petabyte, And how do I make up the need the needs cap I Either make it up through filling up the stuff go streaming in.

      [Enrico Fermi Institute] 16:16:18
      And out, or I make a a buy storage at the side, and and is so on.

      [Enrico Fermi Institute] 16:16:28
      How much time you have to fill out. You can talk about the different types of costs and different example scenarios, because cause problem with.

      [Enrico Fermi Institute] 16:16:36
      So these things about caches, or you know, looking at it, site and it's a trade-off, you can say.

      [Enrico Fermi Institute] 16:16:42
      Well, if I put 200 TB on the site, I might say the extermination years.

      [Enrico Fermi Institute] 16:16:48
      but then obviously some sites. No, or I I I can, find a quote for what it takes which termites, own that expanse as an example, but usually usually about 8 or 5 storage.

      [Enrico Fermi Institute] 16:17:04
      Then Well, that's the problem. What what is Usually I can tell you I'm doing this.

      [Enrico Fermi Institute] 16:17:10
      I can tell you what the nurse allows you to pay by Give the money and do it, and some of their smaller sites.

      [Enrico Fermi Institute] 16:17:17
      That's in fact, how the Atp group got into the Lcrcs.

      [Enrico Fermi Institute] 16:17:22
      They have a condom try to check. They'll deploy it to.

      [Enrico Fermi Institute] 16:17:27
      That's be it, though, because storage is like a multi.

      [Enrico Fermi Institute] 16:17:31
      Your commitment? Or do you pay for Do you rent it?

      [Enrico Fermi Institute] 16:17:35
      You pay for you, you basically depends on it. It's usually, you know, for a quant of time which may be multi year, but at the end of the quanta bye bye, way up a couple of scenarios to avoid the fact that some of these are trade offs and it to communicate

      [Enrico Fermi Institute] 16:17:57
      But we prefer that it comes through the allocation process, because, indeed, application we lay out a use case, and we say, we can use this much Cpu And then But then we need that much storage to actually effectively use it?

      [Enrico Fermi Institute] 16:18:10
      So this would be a

      [Enrico Fermi Institute] 16:18:13
      Could not be a preferred choice that we have to buy storage. Gets into how much time you want to spend joining them scenarios.

      [Enrico Fermi Institute] 16:18:21
      There's a lot to write here. The Hpc facilities typically haven't had in their architecture something sitting there that's looking like my cash that's that's facing the white area. Network.

      [Enrico Fermi Institute] 16:18:33
      I. In other words, they have different ways of provisioning storage within.

      [Enrico Fermi Institute] 16:18:40
      But usually like we saw from like that nurse. If there's a big scratch disc, there's a there's other storage there I mean, there's the home file system the big scratch area. There didn't, seem to be is there something that's sitting on the edge, of

      [Enrico Fermi Institute] 16:18:53
      the network that could actually serve as a cache

      [Enrico Fermi Institute] 16:18:59
      I mean, the file system are connected. Get a data transfer, not to the outside, and that's a separate connection.

      [Enrico Fermi Institute] 16:19:05
      It's not internal, but that's usually high speed, so you can get in and out of there.

      [Enrico Fermi Institute] 16:19:11
      It's not visible on the onset, though. What's your budget

      [Enrico Fermi Institute] 16:19:17
      I think it's what Doug was saying

      [Enrico Fermi Institute] 16:19:21
      You 5 more switches. I remember the cash, so we'll say yes

      [Enrico Fermi Institute] 16:19:30
      Okay, any other comments from the Zoom

      [Enrico Fermi Institute] 16:19:38
      I think we're done. Thanks, everybody for slogging it out.

      [Enrico Fermi Institute] 16:19:43
      Yeah, So I think that's good, because we've I mean, we'll come back to Hpc at some of the later discussions.

      [Enrico Fermi Institute] 16:19:50
      But the focus tomorrow morning will be on. Yes, start with the cloud focus area tomorrow, and then in the afternoon we'll have networks, integration, hypotheticals, and R.

      [Enrico Fermi Institute] 16:20:04
      And D: Okay, Good. Thanks, everybody. We'll talk to you tomorrow.

      [Antonio Perez-Calero Yzquierdo] 16:20:09
      Thank you.

      • 13:00
        Current Landscape and Use 20m

        [Enrico Fermi Institute] 14:00:40
        we're just getting back into the room here and getting started again.

        [Enrico Fermi Institute] 14:00:44
        So now we're starting the Hpc focus area.

        [Enrico Fermi Institute] 14:00:49
        Block: Yeah, thanks. Yeah, So we can jump right into it here.

        [Enrico Fermi Institute] 14:00:55
        Okay, see, the people are, are rejoining inside. Yeah, So this afternoon we have the Hbc focus there.

        [Enrico Fermi Institute] 14:01:04
        So we we already did quite a bit of discussions, but the hope is that we kind of maybe go a little deep on certain type of topics, and we also have some maybe some questions and points for discussion that Rand brought up yet So this is just a redo on maybe a little bit deeper than than on the

        [Enrico Fermi Institute] 14:01:22
        introduction slide on? What? Basically what? We're targeting And the separation of the user focus facilities and Lcfs: Maybe one thing here on the user focus facility that maybe has an been discussed a lot is where this is going for the Nsf funded Hbc: if they stay

        [Enrico Fermi Institute] 14:01:50
        on Cpu only, or whether they will also follow the transition to Gpu, because so far they're pretty much follow their users.

        [Enrico Fermi Institute] 14:02:02
        They have a few gpus on the side for training and and test out, but it's usually it's not the bulk of the facility and nurse has made that switch with the transition, phone court to armada.

        [Enrico Fermi Institute] 14:02:15
        So we have to worry about the same switch happening in the Nsf facilities at some point they have the same power constraints, probably not because they're smaller facilities, but I mean they're also they're also getting larger.

        [Enrico Fermi Institute] 14:02:37
        Right, do you have any input on that question which question the What about the next generation of Nsf Funded: Hpc: Do we have to worry about making the transition?

        [Enrico Fermi Institute] 14:02:46
        To Gpu to stay on Cpu and follow with him. Oh, users, there's always gonna be big big hunk in Cpu machine.

        [Enrico Fermi Institute] 14:02:56
        so I don't think Andville expanse or or any sort of outfiters.

        [Enrico Fermi Institute] 14:03:06
        Okay, bye past that. You know it comes. Question was like, Do you believe what Nsf.

        [Enrico Fermi Institute] 14:03:14
        Spend authorized by Congress? Or do you believe what they've been appropriated by Congress?

        [Enrico Fermi Institute] 14:03:19
        So some of the big expansions, you know, would allow a leadership class facility on the Msu side, and that would be for a lot of the same reasons on the Ue.

        [Enrico Fermi Institute] 14:03:39
        Side very on. The other end. So if you, can't believe that's that's done, and it's gonna happen then, Yeah, there's there's gonna be a big honking, heavy Gpu: machine.

        [Enrico Fermi Institute] 14:03:49
        But I I don't think that that's going to be.

        [Enrico Fermi Institute] 14:03:54
        In addition to the other tapes. Resources they always they have I mean the the the big machine that they have right now is from town.

        [Enrico Fermi Institute] 14:04:02
        That's all. CD It's it's very, very.

        [Steven Timm] 14:04:04
        great if you look at their website. Yeah, if you look at the cat website, there is also zoom We are about our leadership class facility machine coming to I don't.

        [Enrico Fermi Institute] 14:04:06
        It's not a leadership plus

        [Steven Timm] 14:04:19
        Think they say one it's going, but they say it's coming

        [Enrico Fermi Institute] 14:04:22
        Yeah, So they they've gotten authorization to do science studies.

        [Enrico Fermi Institute] 14:04:26
        And you know, they're they're doing all the kind of energy gathering to do such a thing. But at some point somebody has to come up with a slug of money, and I think if what Congress has authorized the Nsf has sufficient slow the money because they're total budget goes up

        [Enrico Fermi Institute] 14:04:45
        by 20. But Congress, at least in 2,022 has not actually given them the money.

        [Enrico Fermi Institute] 14:04:54
        So that's why I kind of. That's where it gets into crystal ball or anything you can lose your your whole afternoon to try to guess what funding agencies are going to do.

        [Enrico Fermi Institute] 14:05:01
        So I I wouldn't suggest doing that. But you know again, the short version is, I I personally believe that there's always going to be some sort of heavy Cpu resources, because they are wildly popular within and Nsf: there are going to be Gp: resources.

        [Enrico Fermi Institute] 14:05:18
        So all the Gpus that are. Oh, I guess you have.

        [Enrico Fermi Institute] 14:05:21
        Britain's too, but it's gonna be a very balanced, based on the user.

        [Enrico Fermi Institute] 14:05:26
        Community Yeah, the thing that might change would be different or grow is whether or not you believe this tack leadership facility? Good.

        [Steven Timm] 14:05:34
        good.

        [Enrico Fermi Institute] 14:05:35
        Okay, So we will The one. The question, though.

        [Ian Fisk] 14:05:36
        oh!

        [Ian Fisk] 14:05:41
        Bye, I wanted to mention a couple of things, Expanse Is not that big expanse night expense is 90,000 cores which makes it like a 10 of the Wsg It's not it's a it's it's far from a leadership class machine

        [Enrico Fermi Institute] 14:05:50
        Yeah.

        [Steven Timm] 14:05:51
        Indeed

        [Ian Fisk] 14:06:04
        and I I think the thing that. And if you look at where Nfsf.

        [Ian Fisk] 14:06:08
        Has spent their money. They've also spent their money on really exploratory things, like like voyager, which is an Ai.

        [Steven Timm] 14:06:13
        Yeah.

        [Enrico Fermi Institute] 14:06:14
        What is they? Have an arm, chess bed, Stony Brook right now.

        [Ian Fisk] 14:06:15
        Yeah, And yeah, yeah, they have the So Japanese name.

        [Enrico Fermi Institute] 14:06:20
        or commi. I think

        [Ian Fisk] 14:06:21
        yeah, And so they've also spent some money in exploratory things.

        [Ian Fisk] 14:06:27
        And my guess is that Brian's right in the sense that they will Nsf is a little bit more in tune to what people are using, But you could imagine that, like that could change and as people figure out How to use alternative machines that like the Gpus in addition to having a lot more processing

        [Steven Timm] 14:06:29
        Yeah.

        [Ian Fisk] 14:06:45
        power are a lot more processing power per block that becomes important to people like that then there'll be pressures there, too.

        [Enrico Fermi Institute] 14:06:48
        Yeah.

        [Enrico Fermi Institute] 14:06:54
        Yeah, that that's I. I guess the point I was making is Nsf.

        [Enrico Fermi Institute] 14:06:59
        Is very attuned into the user base. 5 years from now the user base is screaming for Gpus because machine learning has eaten the world.

        [Ian Fisk] 14:07:09
        right.

        [Enrico Fermi Institute] 14:07:10
        Then then you're gonna see a much stronger, and and under, even if if that doesn't happen, I don't get the impression that there's a lot of growth opportunity even at Nsf: Funded Cpu: Hbc: Yeah, it's a little bit organic growth.

        [Enrico Fermi Institute] 14:07:27
        I mean the bridges choose faster than bridges, and expanses a bit more fast than in common.

        [Steven Timm] 14:07:27
        Great

        [Enrico Fermi Institute] 14:07:32
        But it's not a magnitude, but it's not.

        [Enrico Fermi Institute] 14:07:33
        It's not. They don't like double or triple the capacity from step to the left

        [Steven Timm] 14:07:36
        Great

        [Steven Timm] 14:07:40
        This is a question. I'm not sure if you're gonna come to it later in the thing.

        [Steven Timm] 14:07:44
        If something was too early to ask. But you see, or even more Cpu, that you need, and existing leadership class facilities are not going to grow with them much.

        [Steven Timm] 14:08:00
        During their time, Your location on them is that we can grow that much by that time, and but you had our national web.

        [Steven Timm] 14:08:07
        They're not buying more because strateg strategically, seeing we're going to the we're going to the leadership class facilities.

        [Steven Timm] 14:08:16
        We we're we're seeing it because but there's a gap there's going to be a gap of between 50 and 70% of the resources you need are not going to be there.

        [Steven Timm] 14:08:26
        This is The projections are very. You can done Hpc's not gonna solve the whole problem.

        [Steven Timm] 14:08:30
        They're not enough of them good if you guys at all.

        [Enrico Fermi Institute] 14:08:34
        Hmm. I mean if you, if you can use this, the the Gpu, and that gets to the second point We have the Lcf.

        [Steven Timm] 14:08:43
        Yeah, yeah.

        [Enrico Fermi Institute] 14:08:44
        Where I'm going a little bit into the Lcf.

        [Enrico Fermi Institute] 14:08:46
        Landscape, and then we discussed a lot of that already in the morning session.

        [Enrico Fermi Institute] 14:08:50
        But one thing is the trend trick. To accelerate us.

        [Enrico Fermi Institute] 14:08:56
        if you look at what's there in terms of cpu, that's usually significant.

        [Enrico Fermi Institute] 14:09:01
        Most of it is on the Gpu side which we can't really use effectively right now for the but there's a lot of cpu there, and what's in my mind what's an open question I think is what's the threshold for being able to use these machines what's

        [Enrico Fermi Institute] 14:09:19
        good enough in terms of Gpu. Use utilization.

        [Enrico Fermi Institute] 14:09:24
        I don't know the answer to that. I know that very early on when that move started to happen, it was a state that was statements that I heard from people that were meetings with the agency that they say Oh, you have to use these full-on gpu utilization or you're not going to

        [Enrico Fermi Institute] 14:09:42
        get allowed on the machine, and that's softened significantly over time.

        [Enrico Fermi Institute] 14:09:46
        But still, I mean, there's there's the 2.

        [Enrico Fermi Institute] 14:09:50
        There's 2 sides. One is this: What do we need to do to get a proposal through?

        [Taylor Childers] 14:09:56
        sure, sure.

        [Enrico Fermi Institute] 14:09:57
        What? How much do we need to use the Gpu?

        [Enrico Fermi Institute] 14:10:00
        So we don't feel ashamed of running on these resources ourselves.

        [Enrico Fermi Institute] 14:10:05
        there's a certain point where it's just ridiculous, even if they would allow us to run that right?

        [Enrico Fermi Institute] 14:10:10
        So we have a question coming from problem

        [Paolo Calafiura (he)] 14:10:12
        it's it's it's a comment. Really.

        [Paolo Calafiura (he)] 14:10:17
        I I keep hearing this, the problem framed in this way, not only here, but you know in Atlas a lot even more than here.

        [Paolo Calafiura (he)] 14:10:26
        Probably like all darn and the the Hpc. Community is making this move to Gpu.

        [Paolo Calafiura (he)] 14:10:32
        They Are losing all of their users? I I don't have a precise data, but my understanding and adoptically is that today, if you want to run on a Gpu Node on parameter you have to wait hours, so the then the we are we are legged, okay, the new

        [Enrico Fermi Institute] 14:10:47
        Yes.

        [Enrico Fermi Institute] 14:10:52
        Yeah.

        [Paolo Calafiura (he)] 14:10:54
        communities. They have no problem whatsoever in using accelerators.

        [Paolo Calafiura (he)] 14:10:59
        So we have a choice. Either we either. We become like banks, We keep planning our Ibm V.

        [Paolo Calafiura (he)] 14:11:05
        Three-seven, and call ball, or and we are fine, you know we have the money to do it, and we accept the physics limitation that come with it.

        [Paolo Calafiura (he)] 14:11:16
        Or we jam. I think this. The you know, framing the problem like, Yeah, maybe Nask is gonna give.

        [Paolo Calafiura (he)] 14:11:23
        I mean, next is gonna give us what we have now presumably for the lifetime of per matter.

        [Paolo Calafiura (he)] 14:11:29
        That's about 1%. Oh, that's a a simulation.

        [Paolo Calafiura (he)] 14:11:33
        I know the the outlaws numbers. I don't know the others.

        [Paolo Calafiura (he)] 14:11:36
        I mean is it? It It's nice to have it.

        [Paolo Calafiura (he)] 14:11:40
        But is it? Is it worth having a workshop? About 1%, you know, as multi or 2?

        [Paolo Calafiura (he)] 14:11:45
        I think we I think we either. We make the the see that we make the jump, or or we are.

        [Paolo Calafiura (he)] 14:11:53
        We just step out and we say, Look, we will use our legacy cpus, and then perhaps for ram 5, when I'm retired, or worse, we will, use Whatever architecture is is he's so about that so I I I think we're framing the problem.

        [Enrico Fermi Institute] 14:12:06
        But

        [Paolo Calafiura (he)] 14:12:11
        The problem in us slightly wrong way, and I know that I know that there are other slides discussing the discussing accelerators and whatnot.

        [Paolo Calafiura (he)] 14:12:21
        But yeah.

        [Enrico Fermi Institute] 14:12:23
        But but, Apollo, that the jump it's not going to be a jump to the top in one.

        [Enrico Fermi Institute] 14:12:27
        Go We're going to jump up one step, and then we might.

        [Enrico Fermi Institute] 14:12:30
        We can jump up the next step, and so on, and and for that to get to that first step.

        [Enrico Fermi Institute] 14:12:36
        That's basically my question, Because

        [Ian Fisk] 14:12:37
        right. But I think Dirk would probably say, which I agree with is that I think we at some point we have to commit, that we are going to make, that this is a step we're going to make that we're going to succeed at this and We can define what success.

        [Ian Fisk] 14:12:51
        Looks like, but we sort of have to like it. Says you're going to do this, and I think, and you I think you have to say that because like to first order, all of the processing is in these machines the other thing is, I think we're actually not as far as we think like

        [Enrico Fermi Institute] 14:12:54
        Yeah, I mean.

        [Ian Fisk] 14:13:06
        atlas, and not Atlas Cms. At least.

        [Ian Fisk] 14:13:10
        LCD. Are all using Gpus in the online right now. Running software.

        [Ian Fisk] 14:13:13
        They wrote, We're not that far away, and I think the you can define whatever sort of metric that you want.

        [Enrico Fermi Institute] 14:13:14
        Okay.

        [Ian Fisk] 14:13:20
        But my guess is that a few algorithms that show that the thing is faster with the Gps than without enough to sort of get you in the door

        [Enrico Fermi Institute] 14:13:28
        But yeah, that's that's that was my question.

        [Enrico Fermi Institute] 14:13:30
        I think that. And I agree with the with the answer. I just wanted to phrase it as a question, because I know there are disagreements about that. And there, are also statements from the people that fund these machines that years ago that were different than that

        [Ian Fisk] 14:13:40
        Alright, and I think the and one of the things that we have to be a little bit careful of is that you can be a victim of your own success here, like if you take advantage of the accelerated resource.

        [Ian Fisk] 14:13:51
        And the process. The time for reconstruction of the tracker and Cms goes up by a factor of 10.

        [Ian Fisk] 14:13:56
        Like We do not have an Io system that's designed to handle twice, 10 times the data going in

        [Enrico Fermi Institute] 14:14:05
        There's a comment from Eric

        [Eric Lancon] 14:14:09
        yes, I wanted to go back on what? The power and yeah, make sure.

        [Eric Lancon] 14:14:17
        And I believe that are 2 topics which are mixed here.

        [Eric Lancon] 14:14:21
        It's accelerators and Hpcs.

        [Eric Lancon] 14:14:27
        So as mentioned by Yan with the code radio will be ready by almost of the experiments by necessity, to for using accelerators.

        [Eric Lancon] 14:14:40
        So nothing prevents classical sites to Well, further. Accelerate us as a resources for the experiment.

        [Eric Lancon] 14:14:51
        No the use of the big H species he is supposed to to to Hmm!

        [Eric Lancon] 14:15:01
        Hmm to address the lack of cpus rapidly moving forward for eigenvectors

        [Enrico Fermi Institute] 14:15:12
        Okay.

        [Eric Lancon] 14:15:16
        Is the missing factor as big as we believe. That's what we have to understand.

        [Eric Lancon] 14:15:23
        Because do we need to use H. Pc. Or not? The read question to complement the classical resources beyond the standard extra operation? It's not so clear.

        [Eric Lancon] 14:15:34
        That's really really need the the big Hpc.

        [Eric Lancon] 14:15:43
        For complementing the effort of the I. At the Eigenvalues.

        [Eric Lancon] 14:15:44
        Is it it true or not? Maybe it's only effect off 50% above the needs

        [Enrico Fermi Institute] 14:15:56
        Okay.

        [Paolo Calafiura (he)] 14:16:00
        I can comment on the needs is already at my end up having been involved in the calculated One of the things we have to keep in mind is that the needs 2 sort of naturally tuning to the to the resources available.

        [Paolo Calafiura (he)] 14:16:20
        So there is no point in paralleling. Your needs are 100 times bigger than the resources you are available.

        [Paolo Calafiura (he)] 14:16:26
        So you make choices which makes those needs go down.

        [Paolo Calafiura (he)] 14:16:32
        And and what what I'm very nervous about is that as we try sort of to to to achieve a a a, a, a reasonable set computing model, we are potentially giving up things that we could do especially in a world of precision physics that we which is what which is the

        [Paolo Calafiura (he)] 14:16:55
        one where we are moving towards with the 1 3 run 4.

        [Paolo Calafiura (he)] 14:16:58
        I don't know about on 5, so I'm a little bit nervous that we that yeah, we we don't really need It It's still we don't really need it.

        [Paolo Calafiura (he)] 14:17:08
        But because we're making sure physics choices which are allowing us not to need it, and whether those choices are wise or not, I I probably not competent department, but they they said

        [Enrico Fermi Institute] 14:17:28
        The end was yeah.

        [Ian Fisk] 14:17:29
        Yeah, it was. It was also. It was just a comment about the scale, which is to say that I think that we've been sort of like driven into a We started planning for the W's the at Atlanta we had sort of factors of 6 or 10 more than we could expect.

        [Ian Fisk] 14:17:45
        And that we saw that it was really terrible. And then we've made some improvement.

        [Ian Fisk] 14:17:49
        So we fix it, and now it's down. But like the difference between failing completely and sort of making some really painful choices I think we're now at the level of like if, the Hbc's got us 25% and that allowed us to make a lot fewer really painful

        [Enrico Fermi Institute] 14:17:57
        You.

        [Ian Fisk] 14:18:04
        choices like I understand, 25% is not a factor of 4 or 5 like.

        [Enrico Fermi Institute] 14:18:06
        Okay.

        [Ian Fisk] 14:18:10
        It was a few years so back, but it it seems like like there was a time.

        [Ian Fisk] 14:18:14
        Certainly if someone told you that you had 20% more computing resources, you would have been through

        [Ian Fisk] 14:18:24
        And it just seems like these The these Brisbane are on the table.

        [Ian Fisk] 14:18:28
        They are. So we built them. They're there. It seems like we would be.

        [Ian Fisk] 14:18:34
        It's a really straight. It'd be a really strange choice not to at least try to use them

        [Eric Lancon] 14:18:40
        no, no, I agree. But the first thing is to get the software

        [Enrico Fermi Institute] 14:18:50
        Yeah, maybe that's a good way to lead over to the next, which is looking at how we're actually using these facilities like some of the integrations next slide

        [Enrico Fermi Institute] 14:19:02
        So where are we actually running today? Actively So, Atlas, you want to say something about So now, let's we've been, you know, using Corey and promoter for multiple years.

        [Enrico Fermi Institute] 14:19:15
        we we had an in having the hopper proposal for using attack from Tara.

        [Enrico Fermi Institute] 14:19:21
        Again. In the past we used olcf nails.

        [Enrico Fermi Institute] 14:19:25
        Yeah, yeah. But those are sort of government now. Yeah, most of the focus is on on the on nurse control. Better.

        [Enrico Fermi Institute] 14:19:32
        And but tackle. Yeah. Cms: Similarly, we focused on the user facilities because low hanging fruits it was easier?

        [Enrico Fermi Institute] 14:19:42
        And Corey Palmera multiple years, we have a exceed now, I guess, is access, hasn't happened yet.

        [Enrico Fermi Institute] 14:19:50
        So the next one you'll we'll have to deal with access We we had been running on whatever was available.

        [Enrico Fermi Institute] 14:19:58
        Currently that set is purchased to expense Anvil and Samp, 2 in the past.

        [Enrico Fermi Institute] 14:20:04
        It was Bridges comment there was, and Frontera, we've been running for multiple years, and then we had in the past, and one currently active in the past.

        [Enrico Fermi Institute] 14:20:16
        We had the theta allocation that was joined with with outlast.

        [Enrico Fermi Institute] 14:20:20
        We said to do some generated And now we have actually trying bit something a little bit more serious, which is on summit to get the contribute summit resources.

        [Enrico Fermi Institute] 14:20:35
        To the end of year, 22 Cms.

        [Enrico Fermi Institute] 14:20:40
        Data view construction, and this the physics, Validation of power was just completed, not mid summit, but with my 2,100, which is basically exactly the same system.

        [Enrico Fermi Institute] 14:20:51
        Architecture, the summit, but that was cpu only validation.

        [Enrico Fermi Institute] 14:20:56
        So hopefully she'd be old as the next step. Basically, that's what we want to do with sound.

        [Enrico Fermi Institute] 14:21:03
        Yeah. Also have some slides from the you know, Yeah, there's European efforts as well.

        [Enrico Fermi Institute] 14:21:09
        Just wanted to show it as an example of what's because they they follow sometimes different approaches and in terms of integration.

        [Enrico Fermi Institute] 14:21:16
        So you're using Gpus in the end of 2,020 into data. Really, that's the plan that we want to use.

        [Enrico Fermi Institute] 14:21:22
        We have 50,000 h on parameter that we got the allocation, and we have 50,000 h, and some which is not much, which we hope, so.

        [Enrico Fermi Institute] 14:21:31
        It's not going to contribute a lot, but we just want to show proof principle.

        [Enrico Fermi Institute] 14:21:36
        And then, if it works, then we would ask for more, hours for the next salesc to do this again.

        [Andrew Melo] 14:21:41
        sure, sorry. What was the second half of Rob's question I heard, and you want to use Gpus, and then I kind of yeah

        [Enrico Fermi Institute] 14:21:41
        But with the larger

        [Enrico Fermi Institute] 14:21:51
        So the I was asking if the in the plans for the end of 2022 data re-record and if you're going to use Gpus.

        [Enrico Fermi Institute] 14:22:03
        yes, I mean the the the problem is more at the moment, and putting together a workflow but trying to figure out which if you algorithms are ready, put it in and it, it might just be that we're going to run something in parallel to the normal, reconstruction, and then use, that as a

        [Enrico Fermi Institute] 14:22:23
        validation, Maybe run some validation samples. I would be happy with that as well.

        [Enrico Fermi Institute] 14:22:27
        It's not directly immediate to be reconstruction, but that more like work for again, and that they can compare

        [Andrew Melo] 14:22:35
        It is so about that we actually do have a an offline work, re reconstruct, workflow that's very close to being validated.

        [Enrico Fermi Institute] 14:22:39
        Okay.

        [Enrico Fermi Institute] 14:22:44
        And I know I know, I know.

        [Andrew Melo] 14:22:45
        Yeah, yeah, but it's just. It's just a matter of there's some There's some issues with the the Cp.

        [Andrew Melo] 14:22:52
        Side of the memory being, you know, take more than it needs, but I think by the end of the year, for sure, we're going to at least be doing some fraction of the reconstruction using with Gpus

        [Enrico Fermi Institute] 14:23:01
        Yeah, I hope I hope that that will happen, and then we can

        [Enrico Fermi Institute] 14:23:07
        Great. Yeah, as far as integration goes, specific technologies for Atlas we're we're using Harvester that runs at the edge.

        [Enrico Fermi Institute] 14:23:19
        So at all of our Hpc facilities we run a harvester process that essentially exists on the Hpc.

        [Enrico Fermi Institute] 14:23:24
        Login, nodes, Harvester directly pulls, drops down from Panda, transforms them, and packs them appropriately, so that they can, you know, be sent to the local Hpc.

        [Enrico Fermi Institute] 14:23:36
        it also handles the data transfer. So it facilitates staging.

        [Enrico Fermi Institute] 14:23:40
        That that data in and out of the pursuit of data federation essentially by way of a third-party service that lives at Bnl.

        [Enrico Fermi Institute] 14:23:50
        Hum. Yeah, And so you know, this approach works kind of on on all the sites, including Lcs, because pilots don't necessarily have to talk to the wider your network. Everything is, local and and Harvester facilitates all the communication panda through the shirt and file system.

        [Enrico Fermi Institute] 14:24:12
        Then we do things a little bit differently. Busy has advantages and disadvantages.

        [Enrico Fermi Institute] 14:24:18
        The advantages mostly on the Hpc. Integration of the user facilities, because it really makes it look like a great side.

        [Enrico Fermi Institute] 14:24:30
        It's basically the same approach we use for opportunistic was to use when we tried to run on the Ligo side.

        [Enrico Fermi Institute] 14:24:36
        We were basically we, the software is available. Here. Cvmfs or Cvs X.

        [Enrico Fermi Institute] 14:24:42
        That we run ourselves, we use container solutions or Sm.

        [Enrico Fermi Institute] 14:24:46
        Independence, local squared, and no man should storage at at these facilities, so we treat it as an extension of it's basically an add-on to firmly love storage so it uses.

        [Enrico Fermi Institute] 14:24:58
        Firm enough storage, or avoiding Aaa, the the whole Cms stars, but mostly from it up for reading input data, streaming input data.

        [Enrico Fermi Institute] 14:25:06
        And Then it stages out directly to fungi, so we don't have to worry about the local side storage or data transfers.

        [Enrico Fermi Institute] 14:25:11
        extension, managed. It's just everything is contained within the job, and the provisioning integration follows the Osg models.

        [Enrico Fermi Institute] 14:25:21
        So we submit pilots through ht Condo, Bosco.

        [Enrico Fermi Institute] 14:25:23
        Remote. Ssh! That's either the case of nurse directly connected to have cloud, or for exceed tag resource.

        [Enrico Fermi Institute] 14:25:31
        We go through was g-managed. HD. Conferences, and we might eventually also do the same for those you stage in or streaming is dreaming.

        [Enrico Fermi Institute] 14:25:40
        And do you know, have you measured, oh, staging and streaming to see the We know we know it for Nask, because a nice, we have no, basically we it's not the storage is now fully integrated, but at the beginning.

        [Enrico Fermi Institute] 14:25:56
        It wasn't fully integrated, and we just copied in more or less manually, The most often use pile up library.

        [Enrico Fermi Institute] 14:26:04
        They give us some space for that, and I actually have a comparison.

        [Enrico Fermi Institute] 14:26:07
        It makes very little difference for job failure reads Cpu: Efficiency is about 5 to 10% different.

        [Enrico Fermi Institute] 14:26:14
        Okay, So it's a small It's an efficiency organization.

        [Enrico Fermi Institute] 14:26:19
        It's a noticeable effect, but it's not a huge effect exactly.

        [Enrico Fermi Institute] 14:26:22
        You don't see a 50% trial, for example.

        [Enrico Fermi Institute] 14:26:28
        And the downside of this I mean the the upside is that it's it's it's simple.

        [Enrico Fermi Institute] 14:26:34
        We don't have anything running permanently at the Hbc side.

        [Enrico Fermi Institute] 14:26:38
        It's basically completely follows the the grid model integration.

        [Enrico Fermi Institute] 14:26:43
        The downside is that the Lcf. Are really not really compatible with this approach, because you don't have the outbound Internet you can't follow this approach completely The runtime kind of works the same way, because Cbmfs Xx and singularity.

        [Enrico Fermi Institute] 14:26:58
        Are both there, so that part works, and as long as you can, somehow, what a split server on the edge!

        [Enrico Fermi Institute] 14:27:03
        You can do things. The the degrade at the provisioning layer It's the larger issue.

        [Enrico Fermi Institute] 14:27:11
        Yeah, and we we only have prototypes. So far, nothing.

        [Enrico Fermi Institute] 14:27:13
        We would call, okay, okay, and triple a re.

        [Enrico Fermi Institute] 14:27:17
        So far cost is also not usable, so we can't stream to Lcf.

        [Enrico Fermi Institute] 14:27:21
        Batch nodes, the 2 possible solutions here X d. Proxy and principle is possible where we only ever talked about it.

        [Enrico Fermi Institute] 14:27:30
        I don't think anyone has ever set one up at an Lcf.

        [Enrico Fermi Institute] 14:27:33
        And it's probably too much network traffic to route through a single edge.

        [Enrico Fermi Institute] 14:27:39
        Note, no matter how well that mentioned, that is, but not at least so.

        [Enrico Fermi Institute] 14:27:43
        The scales we're talking about here to make click.

        [Enrico Fermi Institute] 14:27:47
        the other is that you act actively manage the storage.

        [Enrico Fermi Institute] 14:27:50
        So you do. Your rush, your integration, it lovers online, and then you just power live Cms.

        [Enrico Fermi Institute] 14:27:57
        Data management work for management stacks out with that location and pre-stage data And again at the Lcf type scale.

        [Enrico Fermi Institute] 14:28:04
        I think you you need to actively experience

        [abh] 14:28:06
        right, and could I pipe in here just for a second?

        [Enrico Fermi Institute] 14:28:09
        Yeah.

        [abh] 14:28:11
        people have used proxies at Nursk, mind you, the setup there is a little bit easier because they have multiple Dtn's, and you can actually put those all of the use all of them, all of the dtns for the proxy server.

        [abh] 14:28:23
        So so it is possible. But you need a rather fluid setup like nursk

        [Enrico Fermi Institute] 14:28:23
        Huh!

        [Enrico Fermi Institute] 14:28:32
        Yeah. As I said at nurse, it wasn't.

        [Enrico Fermi Institute] 14:28:34
        I mean, I think the work I know connectivity is good enough that we don't really need it at the moment.

        [Enrico Fermi Institute] 14:28:41
        It's not worth yet. Effort

        [abh] 14:28:42
        Okay.

        [Enrico Fermi Institute] 14:28:45
        And problem should be even better, Maybe we haven't really scale tested primarily at that level yet.

        [Enrico Fermi Institute] 14:28:52
        But from from what I saw with the how, the design has evolved, and that's what create us in terms of network integration.

        [Enrico Fermi Institute] 14:28:58
        And from what he said as well, I expected to working better and forward.

        [Enrico Fermi Institute] 14:29:04
        So you're see, the Cs plan is just to I'm going to just not even worry about local storage, and we formula doesn't have a global online license.

        [Enrico Fermi Institute] 14:29:21
        So our plan is that we do Multi-hop transfers through nurse, because Nasa will at the moment still has gr good ftp, and we're working with them to get the extra D transfers going once that is in place our plan is to to manage the Lcf

        [Enrico Fermi Institute] 14:29:35
        data transfer through nurse. So everything goes multi-hop through mass, so we will need a bit of space there.

        [Enrico Fermi Institute] 14:29:42
        So, and once that is a place we might start thinking, exploring, also running actively managed storage there.

        [Enrico Fermi Institute] 14:29:49
        But I will probably still have a large streaming component as a dumb question we could stop going down the rabbit hole.

        [Enrico Fermi Institute] 14:29:54
        Okay. But the I assume, like 7 of the tier, two's have global licenses.

        [Enrico Fermi Institute] 14:30:01
        We could route it through that, too.

        [Enrico Fermi Institute] 14:30:05
        For different.

        [Paolo Calafiura (he)] 14:30:11
        and just, to be sure, understand, by provision and integration you mean assigning the work to workers since they cannot reach

        [Enrico Fermi Institute] 14:30:19
        It's basically the the system. Basically, you have work in the system.

        [Enrico Fermi Institute] 14:30:26
        That is assigned to an Hbc. Now bring up resources to run that work and route.

        [Paolo Calafiura (he)] 14:30:32
        Yeah, yeah, yeah, understood. Yeah.

        [Enrico Fermi Institute] 14:30:33
        The work, then

        [Enrico Fermi Institute] 14:30:41
        So now we have. We have a slide on the security model, strategic conservation and security model.

        [Enrico Fermi Institute] 14:30:48
        We probably don't need to spend too much time on, because there's a discussion on Wednesday where we hopefully have some security folks from formula.

        [Enrico Fermi Institute] 14:30:59
        We invited someone, and maybe from Wsg. As well but we would think we're We wanted to discuss some of the strategic things about Hpc: use, and we we already covered some of it.

        [Enrico Fermi Institute] 14:31:12
        the yearly allocation cycle that it doesn't fit with our resource planning And so we can plan with resources that we're not sure we will have.

        [Enrico Fermi Institute] 14:31:20
        But so far we focused mostly on, since they don't fit our resource planning cycle, and we can pledge them.

        [Enrico Fermi Institute] 14:31:27
        We don't get any credit for it, which is mostly a problem eventually, for the funding agencies.

        [Enrico Fermi Institute] 14:31:31
        But there's another issue. If we say we are moving into a resource constraint, environment for Hlac, it also means resources that are not pledged, and that we can plan with we cannot include them as part, of our plan, which means our plan, has to artificially be downsized to not consider them

        [Enrico Fermi Institute] 14:31:49
        which might be a restriction on us at the moment.

        [Enrico Fermi Institute] 14:31:52
        It doesn't not so much because we have enough resources to cover everything we need to do.

        [Enrico Fermi Institute] 14:31:58
        But that might not be the case anymore in the Hlac environment.

        [Enrico Fermi Institute] 14:32:09
        see Erica's handle

        [Eric Lancon] 14:32:12
        yes, we'd like to intervene, because it's not the first time that we cannot pledge.

        [Eric Lancon] 14:32:20
        I think it's a bit too strong a statement.

        [Eric Lancon] 14:32:26
        It might be better to to say that didn't experiment, or the Wcg.

        [Eric Lancon] 14:32:34
        I need to evolve towards modern addicting campaigns.

        [Eric Lancon] 14:32:42
        If the because we would like to to use currently those Hpc.

        [Eric Lancon] 14:32:49
        As a regular Wseg site No, and it's not so very well suited for this.

        [Eric Lancon] 14:32:57
        You may want to consider that the experiment, you want, the cattle campaigns a few times in the inner year, and this campaign will short duration are exported to those Hpc.

        [Eric Lancon] 14:33:12
        Which have a large capacity. In that case you could consider great doing these resources because you don't have a flat requirement of Cpu across the year From the experiment, You see what I mean.

        [Enrico Fermi Institute] 14:33:30
        So you want to pledge it for specific purposes, specific.

        [Enrico Fermi Institute] 14:33:35
        You want to say like that, that this campaign is is is a pledged campaign on this resource, so that would move away.

        [Enrico Fermi Institute] 14:33:42
        I think we we had that this morning where we said we want.

        [Enrico Fermi Institute] 14:33:46
        We move away from the universal, usable resource pledge.

        [Enrico Fermi Institute] 14:33:51
        That is, basically we can. You could target anything at it to you pledge for a specific purpose.

        [Eric Lancon] 14:33:58
        yes. Because why is it? The Monte Carlo is quite on across the the year to first order?

        [Eric Lancon] 14:34:05
        It's because yeah, it's not enough capacity.

        [Eric Lancon] 14:34:08
        Cpu capacity to absorb the multicarbon simulation Within one month

        [Eric Lancon] 14:34:16
        Just one month is just an example. So the operational model should adapt to the is the type of resources that the experiments want to use.

        [Eric Lancon] 14:34:28
        Maybe

        [Enrico Fermi Institute] 14:34:32
        Okay, Hi Tens, Andrew

        [Andrew Melo] 14:34:38
        yeah. So. So so I did want to point out first off that there is a meeting.

        [Andrew Melo] 14:34:43
        The Wc. Meeting is planned for November.

        [Andrew Melo] 14:34:47
        we're actually going to discuss reopen, for the plan is, I guess at least it to someone who reopen the Mo.

        [Andrew Melo] 14:34:54
        and to discuss things like this. So I I don't think that that's gonna be stuff there forever.

        [Andrew Melo] 14:35:01
        And then I think that also, you know, there's there was just the new heps, for Benchmark is is quickly converging, so that we can actually Then Yeah, you know, these things we do have a unit that we can how do you say like, you know, to be able to make a resource request

        [Andrew Melo] 14:35:18
        then also pledges in. I I do want to push back a little bit and say that like probably don't want to have the pledging infrastructure Be so phygrained to say that we are going to request that we get X amount of whatever's for a certain amount of time

        [Andrew Melo] 14:35:36
        the resources. But I do think that the ability to

        [Andrew Melo] 14:35:45
        Put. Put put these put these facilities into the pledge, and in a holistic way, is something that's going to be hopefully coming with the with the cycle of everything.

        [Andrew Melo] 14:35:51
        How it works definitely. Not 24, but maybe in like the 2526 time scale.

        [Andrew Melo] 14:36:03
        I think that, like

        [Andrew Melo] 14:36:04
        I think that, like you know, with with with the benchmarks and come around that we can actually, you know.

        [Andrew Melo] 14:36:10
        Say what they need to quantify with these machines are, and the I guess political idea that we're gonna reel from the conversation on the Mlu that hasn't been sense, you know, or whatever it is, I think that this is something that we can hopefully get done, in the next you know in the short

        [Andrew Melo] 14:36:25
        term

        [Enrico Fermi Institute] 14:36:27
        Okay.

        [Enrico Fermi Institute] 14:36:27
        okay.

        [Enrico Fermi Institute] 14:36:30
        Okay, smart a com.

        [simonecampana] 14:36:35
        yes, I think there is a bit of confusion. First of all, on the latest topic.

        [simonecampana] 14:36:41
        If you read the mou there is nothing written there, says that an Hpc.

        [simonecampana] 14:36:46
        Cannot be used as a place to resource as simple as that, so one doesn't have to.

        [simonecampana] 14:36:50
        Redis. Discuss them, or you to discuss this. I There are good Hpc is the impact of the pledges, since at least a decade and a half in the Nova country, you know the Tier one provides resources also partially through time on an hbc so the reality is that the mou tells

        [simonecampana] 14:37:10
        you the basic principles of what can be considered a pledge, Resource has to be something with a certain amount of ability.

        [simonecampana] 14:37:18
        Availability needs to be accounted for. You need to be able to send a ticket to it, and that's what it says.

        [simonecampana] 14:37:22
        So I think that you know, in terms of policy, we don't need the and made Zor discussion and every right of the emoji.

        [simonecampana] 14:37:35
        The work can start today. Think there is something technical to be done, because a lot of what I just mentioned.

        [simonecampana] 14:37:40
        Yeah, Okay, be a technical detail, But someone still has to do the work of integrate integrating the facility properly.

        [Enrico Fermi Institute] 14:37:49
        But but

        [simonecampana] 14:37:50
        The other thing is that when is is the comment I made this morning when you try to define a facility that works for one use case you have 20%, which granularity you want to get If it.

        [simonecampana] 14:38:06
        Is monte Carlo versus the data processing fine.

        [simonecampana] 14:38:09
        If it is a second kind of Monte Carlo, a bit less fine, if it is only a bench generation, because it's the only one that doesn't need an input it starts becoming really finegrained. And for the one of you who participated to a discussion at the Rugby and you know

        [simonecampana] 14:38:25
        the all the process that has to do with resource, requests, etc.

        [simonecampana] 14:38:31
        This becomes very complicated very quickly. So at the end, the risk is that we do a lot of work to pledge Hpcs for a benefit that is not particularly measurable.

        [Enrico Fermi Institute] 14:38:38
        Yeah.

        [simonecampana] 14:38:46
        I think we are confusing. We cook and the work that those Hpcs are doing, and this should be done with the idea that those Hpcs are a multi-purpose facility which today many of them they are not some of them if you try to discuss with the Awkward for

        [simonecampana] 14:39:03
        example today, there is not a lot you can do with a quiz unless you can use all those gpus.

        [simonecampana] 14:39:09
        So is that a multi-pacose facility today is not so.

        [simonecampana] 14:39:11
        I think there is a bit of confusion around what is a policy?

        [Enrico Fermi Institute] 14:39:14
        Okay.

        [simonecampana] 14:39:16
        What is practical, and what needs technical work to be done.

        [simonecampana] 14:39:20
        So. I think this needs to be organized a bit

        [Enrico Fermi Institute] 14:39:25
        But but but even at the policy level, the the one example you gave is is something that, I maybe I should use the word non wlcg resource, or something like this.

        [Enrico Fermi Institute] 14:39:35
        but the the idea of reliability on something where you're not going to use it.

        [Enrico Fermi Institute] 14:39:39
        9 months of the year and then you're gonna get a burst of, you know, 200,000 cores.

        [Enrico Fermi Institute] 14:39:48
        Policy wise. I'm not sure that has any translation.

        [Enrico Fermi Institute] 14:39:51
        I mean that there are for the sorts of resources we're talking about here.

        [Enrico Fermi Institute] 14:39:55
        It. It doesn't fit within the the policy framework That's that's my my concern.

        [Enrico Fermi Institute] 14:40:01
        If if the policy is, it needs to be up 90% of the time, and you need access to a certain base load.

        [Enrico Fermi Institute] 14:40:09
        Of course, first once a year. That's that's not how these things work. So that's why I was saying that we we really do need the policy work here as well

        [simonecampana] 14:40:19
        a little bit, but the reality is that a lot of what we care about is that not not 90% of your jobs fail when you end up there And this being an Hpc.

        [simonecampana] 14:40:29
        Or a great site. I'm sorry it It's a useful thing to ask right

        [Enrico Fermi Institute] 14:40:36
        Yeah, you know, at the same much of the same way that you have, and the power ecosystem, base load, and and variable demand mode.

        [Enrico Fermi Institute] 14:40:47
        I think we have need to have some more fundamental ideas, and the policy framework.

        [Enrico Fermi Institute] 14:40:54
        You know we're you don't right now our power grid is built from cold, and only call, and we say that when can't possibly, it'd be counted for, and and we both of course, have been successful

        [simonecampana] 14:40:59
        yeah.

        [simonecampana] 14:41:04
        I just

        [simonecampana] 14:41:07
        I understand. Brian, but you realize that the discussion on availability is not the one that is today is stopping an Hbc. To be a pleasant resource.

        [simonecampana] 14:41:14
        Right

        [Enrico Fermi Institute] 14:41:16
        Let's take a couple more quick comments, and then we can have more discussions about pledging on on Wednesday. Yeah, we have a dedicated discussion, Andrew, do you have a quick comment

        [Andrew Melo] 14:41:26
        sorry. My hand is still on, but but I'll just quickly point out that.

        [Andrew Melo] 14:41:32
        but that we can't today do this budget, because it's it's not that the pledging statute you can't use Hbc's and pledging.

        [Andrew Melo] 14:41:41
        It's just up the room that are set around Plunge the how you fled.

        [Andrew Melo] 14:41:45
        Resources. Basically, you can't do that, It's it's not that it's like there's an explicit for prohibition from it.

        [Andrew Melo] 14:41:52
        But you just simply just simply can't do it.

        [Enrico Fermi Institute] 14:41:54
        yeah.

        [Enrico Fermi Institute] 14:41:55
        Yeah.

        [simonecampana] 14:41:56
        I I just don't understand this, but fine I'll let it go.

        [simonecampana] 14:41:59
        I mean, there are other places where pledge they pledge.

        [simonecampana] 14:42:02
        Hbc: drop down something that's right.

        [Enrico Fermi Institute] 14:42:02
        Yeah, yeah, yeah, but they they basically put a grid side on top of it.

        [simonecampana] 14:42:07
        Well, then, yeah, you have to do some work. Yes, I agree.

        [Enrico Fermi Institute] 14:42:07
        So with with all the rules. Oh, no! But the problem is here.

        [simonecampana] 14:42:10
        Yeah.

        [Enrico Fermi Institute] 14:42:12
        It means that you would have to influence the scheduling of the Hpc.

        [Enrico Fermi Institute] 14:42:18
        Facility. So the Hbc facility itself would have to adjust internally, adjust their scheduling policy to match the grid model, at least for a fraction of the site And that's just not how things are done in the us We are customer.

        [Enrico Fermi Institute] 14:42:33
        We don't tell them how they do their scheduling.

        [Andrew Melo] 14:42:35
        Okay, Or let me give another example. Let's say that you know today, and I I don't I don't know like you know, the inside of it.

        [Enrico Fermi Institute] 14:42:35
        We use the resources as they give them to us

        [Andrew Melo] 14:42:41
        But you know, let's say that we're not using Amazon for Cms jobs.

        [Andrew Melo] 14:42:46
        We can't send sideability, you know. We can't send Sam tests to Amazon right now, so you know, whatever resource, whatever check the Amazon's gonna give doesn't show up, and the the the monitoring, now, it shouldn't be, that way But that's that's how it

        [Andrew Melo] 14:43:02
        is.

        [Enrico Fermi Institute] 14:43:04
        let's

        [Enrico Fermi Institute] 14:43:05
        Let's take a comment from from Ian, and then let's move on

        [Ian Fisk] 14:43:07
        I My call was, as I understood this was a blueprint meeting which a blueprint is typically the design for something that you're going to build in the future which means that I think we need to be a little bit careful when we talk about.

        [Steven Timm] 14:43:07
        good.

        [Ian Fisk] 14:43:19
        Sort of the reality of right now and the limitations that we face right now and try to be able to see a little bit farther ahead.

        [Ian Fisk] 14:43:26
        For when some of the times when those limitations will not be there, and so if we want to talk about pledging, maybe we need to sort of define it.

        [Ian Fisk] 14:43:32
        In such a way that it it's the ability to maybe the ability to run All workflows or the ability to run some subset of workflows.

        [Ian Fisk] 14:43:41
        But I I think it. We we do ourselves a disservice.

        [Ian Fisk] 14:43:43
        If we expect that nothing's going to change, because I think we will, as a field along with the rest of science, figure out how to use these machine, and we will, and we will figure out how to use clouds.

        [Ian Fisk] 14:43:57
        And we're and we need to sort of plan for our own success.

        [Ian Fisk] 14:43:59
        I think

        [Enrico Fermi Institute] 14:44:05
        So that's a great point

        [Enrico Fermi Institute] 14:44:08
        month. Yeah, we already talked quite a bit about the second point.

        [Enrico Fermi Institute] 14:44:13
        I just wanted to go into it a little bit, because the one ish thing that okay hasn't brought up yet.

        [Enrico Fermi Institute] 14:44:20
        So so basically how we deal with more larger architecture changes.

        [Enrico Fermi Institute] 14:44:24
        We we went into that quite a bit. Already We we already seen this.

        [Enrico Fermi Institute] 14:44:29
        Today, we have, we see multiple Gpu architectures, basically the early porting efforts to Gpu They focused on Nvidia because that's what everyone is using to a large extent.

        [Enrico Fermi Institute] 14:44:40
        That's still what everyone is using. But if you look at what the Lcf.

        [Enrico Fermi Institute] 14:44:43
        Is deploying Frontier has a D. Whenever maybe different, we'll have intel.

        [Enrico Fermi Institute] 14:44:52
        So what are we doing there and then? The next generation might have some weird Fpga ai acceleration.

        [Enrico Fermi Institute] 14:44:58
        Who knows? I know that the framework groups, and this is outside the scopeia is, is looking at performance, portability, solutions.

        [Enrico Fermi Institute] 14:45:06
        so far it looks like yes, you can run everywhere, but you take a severe performance.

        [Enrico Fermi Institute] 14:45:11
        It? Is that enough? That's an ony topic here, but that's the only alternative If that's not enough.

        [Enrico Fermi Institute] 14:45:20
        And if this doesn't work, then you kind of have to limit what you can target, because I'm not sure

        [Taylor Childers] 14:45:26
        sure. Can I push back on that? The you know the Pps group and and have Cce has shown that you can use these frameworks, and sure gonna take a performance that.

        [Taylor Childers] 14:45:38
        But I would argue. 10% is not something that is worth the effort.

        [Enrico Fermi Institute] 14:45:41
        Okay.

        [Enrico Fermi Institute] 14:45:45
        If there was a question mark, because maybe maybe it is to rescue

        [Taylor Childers] 14:45:45
        especially in the mad graph case. Right?

        [Taylor Childers] 14:45:50
        I mean, we're running mad graph with base cuda sickle cocos, alpaca, and sure cuda outperforms.

        [Taylor Childers] 14:46:02
        But the amount of work that has gone into the kuda to get another 10% It's just not worth it.

        [Enrico Fermi Institute] 14:46:11
        Because I think the the 2 options here are like, given the what we have to do in terms.

        [Enrico Fermi Institute] 14:46:17
        And I know this is outside the scope of the workshop, but it impacts what we can plan with basically the only 2 options, Either performance put the portability or we just don't target a certain architecture because we cannot just every 5 years, if lcf decides they want this new greatest and best

        [Enrico Fermi Institute] 14:46:36
        acceleratorship. We cannot just refactor our old software stack It's just not fun.

        [Enrico Fermi Institute] 14:46:44
        So

        [Enrico Fermi Institute] 14:46:48
        Okay. And then in terms of strategic considerations, the use, just because we managed to be able to use this generation's Lcf.

        [Enrico Fermi Institute] 14:46:59
        Doesn't really guarantee that we can use the next, So we need to keep that in mind when we kind of do the long-term planning, because that might come a point where basically the amount of usable usable for us hpc deployment goes down and we need to shift that

        [Enrico Fermi Institute] 14:47:15
        capacity, some ways

        [Enrico Fermi Institute] 14:47:21
        And then there's a quote anyone else have any other comment or concern.

        [Enrico Fermi Institute] 14:47:27
        Strategically about going all in on the like, making the jump, as Paolo said.

        [Enrico Fermi Institute] 14:47:32
        It on the Hpc. Side where we can miss Ms jump

        [Enrico Fermi Institute] 14:47:39
        3 in terms of making the jump mean. I mean, we can sort of hedge our bed a little bit with that, Right? I mean, we don't have to make to jump with 100%.

        [Enrico Fermi Institute] 14:47:52
        Of our computing on. So I mean, that's I mentioned that you don't jump in one.

        [Enrico Fermi Institute] 14:48:01
        at the top. You make a small jump, you see where you are. And you make another jump.

        [Enrico Fermi Institute] 14:48:07
        It's a gradual process

        [Paolo Calafiura (he)] 14:48:09
        one thing. One thing I want to say, which I've heard from from a reliable source is some some community with my multiple jumps is the first jump is the worst one.

        [Enrico Fermi Institute] 14:48:10
        Yeah.

        [Paolo Calafiura (he)] 14:48:22
        The same, and the fourth are increasingly easier, the more the more the more you go for one after that architecture to the other, the the least the least you have to to feed that we are your call could go from one

        [Enrico Fermi Institute] 14:48:40
        Yeah, I didn't even mention it here, because I don't think it's a big problem.

        [Enrico Fermi Institute] 14:48:44
        The multiple Cpu architectures. I think that's at least I don't see the big issue on the Cms side.

        [Enrico Fermi Institute] 14:48:50
        That's just usually, just a recompile and a revalidation.

        [Enrico Fermi Institute] 14:48:55
        The the jeep, the jump to Gpu and I just I'm not

        [Paolo Calafiura (he)] 14:48:58
        no What I'm saying is that once you jump to Gpu or to let's say, a parallelization layer, whatever it is that is a very painful jump.

        [Paolo Calafiura (he)] 14:49:09
        But once you have done that jump, but going from one Gpu to another, or from one Gpu to some, so far I'm known architecture, which we do, you know, the French are both matrix multiplications, and what jacks for example, going to Jack's maybe maybe less less painful than than the first

        [Enrico Fermi Institute] 14:49:11
        Just


        [Paolo Calafiura (he)] 14:49:27
        one That's what I'm saying. That's what I was trying to say.

        [Enrico Fermi Institute] 14:49:35
        Okay, we move on. I think we have some presentations. Next, let's go in the class We want to say something on this. I don't think we say anything.

        [Enrico Fermi Institute] 14:49:47
        On the security model. We'll we'll talk about the security model.

        [Enrico Fermi Institute] 14:49:48
        Yeah, yeah, So you're an Andre, are you? Are you connected?

      • 13:20
        European HPC 30m
        Speaker: Andrej Filipcic (Jozef Stefan Institute (SI))

        EURO HPC

         

        [Enrico Fermi Institute] 14:49:55
        Do you want to share? Yeah.

        [Andrej Filipcic] 14:49:56
        maybe it's it's a screen. Can you hear me? Right?

        [Andrej Filipcic] 14:49:59
        Okay, that's Michelle didn't

        [Enrico Fermi Institute] 14:50:02
        Great. So we want to show a little bit what's going on the European side.

        [Andrej Filipcic] 14:50:04
        Just

        [Enrico Fermi Institute] 14:50:08
        Yeah, then we can just as a

        [Andrej Filipcic] 14:50:09
        Right? So just a bunch of slides. But let me know if you are interested in anything else.

        [Enrico Fermi Institute] 14:50:12
        Yeah.

        [Andrej Filipcic] 14:50:18
        Oh, on some specifics over here. So maybe it's a bit too generic.

        [Andrej Filipcic] 14:50:21
        So the Irish Pc. Joint to the taking his, let's say, a company of 31 States, which I call out here on the right side.

        [Andrej Filipcic] 14:50:34
        All the members apart. Basically all Europe, and Turkey. Apart from me, Okay, and Switzerland.

        [Andrej Filipcic] 14:50:39
        And in the first place, which ended last year, the Web, 8 machines funded.

        [Enrico Fermi Institute] 14:50:43
        Okay.

        [Andrej Filipcic] 14:50:46
        So 3 prixes scale machines in the range of 250 to 350 billion blops.

        [Andrej Filipcic] 14:50:51
        so those one Lumi in Finland, Leonardo, which will be inaugurated to November in Italy, and Marin Austin, which will be a bit later.

        [Andrej Filipcic] 14:51:02
        It goes to Pickerman just finished, but not much details on this machine or yet no, apart from the talk today, Will, he had a quite large Cpu Partition of 30 peasa flops.

        [Andrej Filipcic] 14:51:14
        Which is quite good. For, let's say so. The second phase is the 6 years up to 27, and the currently approved machines, the high range one the exa scale me which would be a Jupiter the the so the machine was just approved but the procurement, was

        [Andrej Filipcic] 14:51:35
        not yet done so all no details on this machine, just the the plans right? Basically there.

        [Andrej Filipcic] 14:51:42
        One to reach one, Maxa flop with some or okay, that's enough.

        [Andrej Filipcic] 14:51:47
        And so there will be 4 arrangements. So for Hpcs in so investments here between 20 and 40 million Europe rate per each, and those one will be in Greece.

        [Andrej Filipcic] 14:52:02
        Hungary, or on an island. I think also there'll be some collocated quantum computers.

        [Andrej Filipcic] 14:52:12
        So the first generation, and this will be approved probably next month.

        [Andrej Filipcic] 14:52:18
        I was skating So this is just a mission which you can read that day later on.

        [Andrej Filipcic] 14:52:24
        Basically your Hpc wants to support leadership, supercomputing, including quantum computing and all the data infrastructure around it.

        [Andrej Filipcic] 14:52:35
        Then they want to develop. They're on hardware, and they want to evolve industry a lot.

        [Andrej Filipcic] 14:52:42
        Let's say to bullet. So the budget. The budget is pretty 50% of from European Commission and 50% from the hosting states.

        [Enrico Fermi Institute] 14:52:46
        Okay.

        [Andrej Filipcic] 14:52:55
        So these are the countries that decide to build the Hpc.

        [Andrej Filipcic] 14:52:58
        although for the smaller machines you're European Commission only funds 35%.

        [Andrej Filipcic] 14:53:04
        So in the phase, one, the 3 S. 1 billion euros were spent for the face to 7 to 8 billion is actually foreseen on the on the plot on this table on the picture you have a detailed breakdown from the European Commission and Then there would be the same matching contribution from

        [Enrico Fermi Institute] 14:53:11
        Okay.

        [Andrej Filipcic] 14:53:25
        the all the Member States. Okay, and also 200 Me, let's say, 200 million is meant for hyperconnectivity.

        [Andrej Filipcic] 14:53:33
        So for Terabyte Network and 50% of the money spend for new product infrastructure.

        [Andrej Filipcic] 14:53:42
        There are many projects in the Tv activities going around it.

        [Andrej Filipcic] 14:53:45
        So, maybe one important one is you eurocc or European competence center which basically he's a very large project with 30 participants of participant.

        [Andrej Filipcic] 14:54:01
        State let's say so. Most of them, and the funding is about 1 million Europe or country per year.

        [Andrej Filipcic] 14:54:07
        the goals are basically to training and connection with the industry and collecting.

        [Andrej Filipcic] 14:54:14
        So the knowledge on Hpc. Whatever that means. There's also centers of excellence, for example, which are mostly dedicated to, let's say, support software development or scalability, extensions of particular groups.

        [Andrej Filipcic] 14:54:28
        They can be dedicated to a particular particular field of science, like chemistry, or molecular dynamics, or something like that, or they can be a bit wider in scope for specific.

        [Andrej Filipcic] 14:54:38
        Let's say data handling for access case something like that.

        [Andrej Filipcic] 14:54:44
        the about 10 they send us a tax sentence, initial funded between 6 to 8 meetings per project, and those calls would be continuing all the time.

        [Andrej Filipcic] 14:54:53
        So to to this period. There are 2 bodies. So research, generation, advisory group and infrastructure advisory group, which basically accepts form recommendations for the illusion in develop and so forth basically, for everything for the research calls for funding and for infrastructure deployment

        [Andrej Filipcic] 14:55:16
        another part of it is your Pm. Process initiative with the name to build European Cpu and Gpu.

        [Andrej Filipcic] 14:55:24
        Of course, maybe he'll be slightly written. That's later.

        [Andrej Filipcic] 14:55:30
        There's also your master for Hpc. Which is just a common university program.

        [Andrej Filipcic] 14:55:36
        So this is a project. The tries that's many countries and universities.

        [Andrej Filipcic] 14:55:42
        Let's say about 30 of them will try to put the Hpc.

        [Andrej Filipcic] 14:55:47
        Studies master status typically in sync and share. Let's say, students share lectures, and so on.

        [Andrej Filipcic] 14:55:57
        there are about 30 projects altogether, so the resource location access is only provided to you in typical users.

        [Andrej Filipcic] 14:56:07
        So basically to members of European Union, the extended one, actually the European Commission, we so share is very similarly managed as praise before.

        [Andrej Filipcic] 14:56:19
        So the place, like calls for publications, with some changes, The first one is developing batch parking.

        [Andrej Filipcic] 14:56:26
        with basically immediate access. So let's say, within a less than a month, maybe even, we think to mix 2 weeks And this is not negligible even in resources.

        [Enrico Fermi Institute] 14:56:31
        Okay, this.

        [Andrej Filipcic] 14:56:37
        So you can get something like up to half a 1 million Cpu hours.

        [Andrej Filipcic] 14:56:43
        for these access, and you get it for for up to a year.

        [Andrej Filipcic] 14:56:48
        Then the regular access, which is a couple of 10 million Cpu hours.

        [Andrej Filipcic] 14:56:53
        Cisp reviewed, and there are also there will be calls future on for industry in public sector.

        [Andrej Filipcic] 14:57:00
        This is not yet right, finalized. Yet, because of the funding issues.

        [Andrej Filipcic] 14:57:05
        And let's say, charging for the industry. So the hosting entity, share.

        [Andrej Filipcic] 14:57:11
        So the owner of the the other house of the Hpc.

        [Andrej Filipcic] 14:57:14
        The country, the policies there are completely regulated by country policies or decisions.

        [Andrej Filipcic] 14:57:21
        So each State can do whatever they want with their latch.

        [Andrej Filipcic] 14:57:28
        so overall the design is of some of Hbc.

        [Andrej Filipcic] 14:57:37
        Is quite classical, but not all of them are really classical.

        [Andrej Filipcic] 14:57:40
        Hbc anymore, as you know, Vega can. Slovenia can was designed to be strict mind of heavy duty data processing, and outbound connectivity which works actually pretty well.

        [Andrej Filipcic] 14:57:54
        For Atlas, where Vega contributes something right between 1340% of Cpu.

        [Andrej Filipcic] 14:58:01
        during the last year, Let's say, then, the second one, Lumi.

        [Andrej Filipcic] 14:58:03
        They have a very large dedicated partition for visualization and services, and they will provide only so they will provide set object storage for long term data preservation.

        [Andrej Filipcic] 14:58:16
        So on, and they want to provide all the more than tools modern nose.

        [Enrico Fermi Institute] 14:58:19
        Okay.

        [Andrej Filipcic] 14:58:20
        From 5 I mentioned it here. It was not been built, but they set that they will be have much larger cpu partition and open access, because the Government decided that the this machine needs to support Ac.

        [Andrej Filipcic] 14:58:35
        So this was great already about on overall, in the architecture, so most of these machines are Janet purpose.

        [Andrej Filipcic] 14:58:46
        Some maybe less general purpose than the others, but they basically all the all of them needs to adapt to the user needs.

        [Andrej Filipcic] 14:58:54
        So they are. It's a bit different. So they're not completely free to set the policies.

        [Andrej Filipcic] 14:59:00
        How these machines will be set up, and what services they can provide, because overall the European, your Hpc.

        [Andrej Filipcic] 14:59:07
        Governing Board, which is representative. These from States can say on what to do with these machines.

        [Andrej Filipcic] 14:59:16
        Right. And there are many countries that participate in these calls, but they don't have Hpc.

        [Andrej Filipcic] 14:59:23
        But they would like to to use it. And for basically, for all the science.

        [Andrej Filipcic] 14:59:27
        And so also interesting, does this stuff. So the current machines mixture of Cpu and Gpu partitions.

        [Andrej Filipcic] 14:59:37
        So Dcp is mostly Amd. Then some intel recently, for example, would be intel.

        [Andrej Filipcic] 14:59:45
        Then there is a one arm machine that will be in Portugal based on fujitsa and they have both Nvidia and and the but most have Nvidia Gpus, and some have like you only have Md So Hello, me.

        [Enrico Fermi Institute] 15:00:03
        Okay.

        [Andrej Filipcic] 15:00:05
        Is the same, Okay, what's the name of the Ocf.

        [Andrej Filipcic] 15:00:08
        Right so. But in any case most notes have gpus.

        [Andrej Filipcic] 15:00:14
        So most of the hardware is Gpu compromises between 60 to 80%.

        [Andrej Filipcic] 15:00:20
        It depends on the machine. Well, there's one small machine, Cpu only, but all the big machines have.

        [Andrej Filipcic] 15:00:25
        Let's say, 24% of Cpu notes, not not even Cpu power.

        [Andrej Filipcic] 15:00:31
        Right? Okay, computing power. So the storage is typically last with Seth, and some also provide some kind of yeahf.

        [Andrej Filipcic] 15:00:40
        So this one is less popular, and most of these machines, basically a apart from Lumi and Carolina, is in in in Czech Republic we're built up by artists during the future machines.

        [Andrej Filipcic] 15:00:59
        Will most definitely. So the large one next Xs K machine, which will be built in France, will be arm based, so it would be arm cpu plus gpu, as well.

        [Andrej Filipcic] 15:01:10
        So details are not clear yet the goal is to build it somewhere in 2425, and after that the next one.

        [Andrej Filipcic] 15:01:20
        So let's say them excess scale machine, whatever it would be.

        [Andrej Filipcic] 15:01:24
        They have strong, wishes. Let's see, for now it should be risk 5 based next slide.

        [Andrej Filipcic] 15:01:33
        So some thoughts, some observation. After 1 point, 5 years of operations of these machines, so each of them have of the order of something like 500 users, which might seem a bit little, for some but in actually most of these users, are completely newcomers, since many other users already have allocations on

        [Andrej Filipcic] 15:01:57
        the large existing machines, let's say, in Italy, Spain, Germany, or France.

        [Andrej Filipcic] 15:02:02
        So go. Ones that are part of price, and do these users have really a lot of different kinds of workloads, so that many node computes jobs We've done Cpu on Gpu And this is mostly carol th the the majority i'll say, is chemistry or material

        [Andrej Filipcic] 15:02:21
        science, although something like at least on Vega. There are something like 30 different applications that the user want to run a lot of users also do small notes or small core parameter scans on tons of independent jobs let's say, let's see like and many many users in the last

        [Andrej Filipcic] 15:02:43
        year start to use machine learning, even. That less users do analysis with machine learning, and this is rapidly actually growing.

        [Andrej Filipcic] 15:02:51
        Because let's say it's quite simple with Tensorflow and all the case.

        [Andrej Filipcic] 15:02:55
        So to locates the around and atar machine, at least in Vega we have a really big pressure on Gpus, so the next machine will buy. We'll have much larger Gpu Partition Oh, some users also do extreme data process processing.

        [Enrico Fermi Institute] 15:03:08
        Okay.

        [Andrej Filipcic] 15:03:14
        No I don't mean I let's see here.

        [Andrej Filipcic] 15:03:17
        But, for example, like something I cry micro microscopy or different stuff, where they produce, let's say, a couple of tens of terabytes per measurement that they want to process is same interactively some hpcs, allocate for notes, only but there are many that can run any type

        [Andrej Filipcic] 15:03:32
        of jobs, Also, observe, we have observed my experience, that many users are not quite happy with the default a data organization of the Hpcs, which basically more or less doesn't exist.

        [Andrej Filipcic] 15:03:45
        I would say, although we have other tools in my future. But let's say, within your your Hbc.

        [Andrej Filipcic] 15:03:51
        The data Migration is a movement. We're not yet discussed and many users stick to containers and some demand event.

        [Andrej Filipcic] 15:04:01
        But virtualization basically here for your Hpc. What user demands to use it basically should be provided to another later.

        [Andrej Filipcic] 15:04:11
        there are much more users on your Pc. At this point, and was ever in place.

        [Andrej Filipcic] 15:04:16
        So this number we probably wrote. So that's it. Cumulatively to 50,000. Pretty soon, on all the all on all the machines and there, are a lot of really a lot of newcomers due to simplicity or taxes, you basically just submit a proposal, not even a

        [Andrej Filipcic] 15:04:33
        Proposal. Application which quick description. And you will get an access within less than a month.

        [Andrej Filipcic] 15:04:39
        The usage for the interstate is rising a bit.

        [Andrej Filipcic] 15:04:43
        this is mostly small or medium enterprises, but this is still not extremely high.

        [Andrej Filipcic] 15:04:50
        Let's say more or less, the entire to use 20% of Hpcs by European law.

        [Andrej Filipcic] 15:04:57
        Let's say, by European funding regulations, but they're not yet at 20.

        [Andrej Filipcic] 15:05:06
        Far from 20% of usage. This point although some Hpcs like the one in Luxembourg, was built entirely to support the industry.

        [Andrej Filipcic] 15:05:15
        7 countries also decided to provide resources through your Hbc. For 4.

        [Andrej Filipcic] 15:05:20
        Let see, Slovenia for sure, because I know what's going on here.

        [Andrej Filipcic] 15:05:25
        You have the same message from Spain or the others be shared.

        [Andrej Filipcic] 15:05:28
        And lately, even in Germany. So I think that the German wants to only keep Daisy and Kit not not sure if this is official yet, but other countries will probably follow a similar way.

        [Enrico Fermi Institute] 15:05:36
        question.

        [Andrej Filipcic] 15:05:40
        So. The and let's see. European. Sorry. Yes, go ahead.

        [Enrico Fermi Institute] 15:05:42
        Questioning.

        [Enrico Fermi Institute] 15:05:47
        so you said. Several countries have already decided, you know, like Sylvania, with highly successful Vega.

        [Enrico Fermi Institute] 15:05:53
        What about the Vega design makes that so much easier to integrate in these would be some of the Us.

        [Enrico Fermi Institute] 15:06:06
        Snowflakes.

        [Andrej Filipcic] 15:06:09
        I'll show, because in Reggae we need the pressure to support civilian vessels.

        [Andrej Filipcic] 15:06:14
        So some some other more classical Hpcs are hesitant in this respect.

        [Andrej Filipcic] 15:06:20
        But let's say Vega is not so different in hardware.

        [Andrej Filipcic] 15:06:23
        Architecture than the others apart, that, apart from that that really required a large pipe which can at the moment this pipe can do 600 gigabits per second to one.

        [Andrej Filipcic] 15:06:35
        That's it to Jean. And this will increase in the future.

        [Andrej Filipcic] 15:06:40
        So it's mostly a matter of decision. What you are allowed allow users to do over there

        [Enrico Fermi Institute] 15:06:49
        Okay.

        [Andrej Filipcic] 15:06:51
        done. The network connectivity. Will will likely boost a lot in the next 2 years. On 2 to 3, as let's say, especially if there's zone.

        [Andrej Filipcic] 15:07:01
        One or a bit network is seen, don't There are some still open questions about the funding is came, and who can do the networking, and so on.

        [Andrej Filipcic] 15:07:14
        long term, long term, data story is not that part of the plans So it's a bit on a wild but there's a high pressure of many communities to use this as well. Right?

        [Andrej Filipcic] 15:07:25
        So the Hpcs. This point are not obliged to provide long-term storage.

        [Andrej Filipcic] 15:07:29
        Let's say when the Hbc. Is decommissioned, the storage is really likely to be the commission as well, and a new storage will be read, brought up in the new machine right?

        [Andrej Filipcic] 15:07:39
        But this this will need to change the future. One thing, one thing worth to stress is that some leadership your projects like destination, Earth?

        [Andrej Filipcic] 15:07:49
        Well, I'm not sure you know it, but destination out to basically as apples.

        [Andrej Filipcic] 15:07:54
        Ecmwf Weather Agency, and it you mets up.

        [Andrej Filipcic] 15:07:59
        But they, is to provide a digital between digital going to for Earth right?

        [Enrico Fermi Institute] 15:07:59
        Hmm.

        [Andrej Filipcic] 15:08:06
        which includes satellite imaging weather collection.

        [Andrej Filipcic] 15:08:12
        whether the data collection on weather forecasting, and so on, and basically do a global model of our team predictions And so on.

        [Andrej Filipcic] 15:08:23
        With, basically, it's a huge project. And this this organization already officially asked, join to the taking if they could use your Hpc on the production.

        [Andrej Filipcic] 15:08:33
        Level, I'm basically joined the deck and agreed, for now they can use 10% of the All the resources right.

        [Andrej Filipcic] 15:08:43
        European Commission. But up to 10% and more organization to follow this way, for example, destination Earth doesn't have enough funding or money to to do anything without your Hbc.

        [Andrej Filipcic] 15:08:57
        At this point. So more projects with this I like this will follow, and maybe even let's see Good.

        [Andrej Filipcic] 15:09:07
        But this was not discussed yet. I will skip the next slides, because I just a bit of an overview for computer as you can.

        [Enrico Fermi Institute] 15:09:10
        Okay.

        [Andrej Filipcic] 15:09:15
        You will see them later on when I upload them

        [Andrej Filipcic] 15:09:20
        Okay, That's it.

        [Enrico Fermi Institute] 15:09:23
        Great. Thank you. We have sub raised hands. So, Paulo, his handwriting for a while, go involved

        [Paolo Calafiura (he)] 15:09:30
        yeah, Do I remember? Oh, yeah, yeah, yeah, yeah, I saw one slide in which you mentioned that short term Let's say the next generation would be armor and the next next one generation may be the risk 5 And I'm wondering if you meant for a cpu replacement, and and therefore

        [Andrej Filipcic] 15:09:46
        Right.

        [Paolo Calafiura (he)] 15:09:53
        also, having accelerator. So are you just saying it will be Qr.

        [Paolo Calafiura (he)] 15:09:57
        More pure risk, file.

        [Andrej Filipcic] 15:09:58
        No, our movie has accelerators.

        [Paolo Calafiura (he)] 15:10:00
        Okay, okay, So like, something like.

        [Andrej Filipcic] 15:10:05
        Yeah, something. Yeah, it's not it will be Grace hopper, style, or separate chips, or whatever

        [Paolo Calafiura (he)] 15:10:12
        Okay, okay.

        [Enrico Fermi Institute] 15:10:16
        Okay, there's a hand up for you.

        [Ian Fisk] 15:10:18
        yeah, my question was I to actually one? Was the the project Earth?

        [Ian Fisk] 15:10:24
        Is that a strategic alliance between your Hbc.

        [Ian Fisk] 15:10:26
        And the project. And the is it multiple year? Does it different from the typical peer review?

        [Ian Fisk] 15:10:31
        Oh!

        [Andrej Filipcic] 15:10:31
        Yes, it's completely different, because this is a long term project for at least 10 years, And even so, so, it's exactly like So no.

        [Andrej Filipcic] 15:10:41
        Let me see, let's say.

        [Ian Fisk] 15:10:42
        Okay, so, but is that. But does that? Is the door open to other multi like things like that?

        [Ian Fisk] 15:10:50
        Lcg. Lhc. Negotiating such an arrangement

        [Andrej Filipcic] 15:10:53
        I think so. I mean, the the thing is that European Commission needs to find such projects in interest to support them. And actually those projects are typically listed in S Free Table, where, for example, high luminosity is right

        [Ian Fisk] 15:11:10
        Okay, was there. I I may have missed it.

        [simonecampana] 15:11:13
        I think

        [Ian Fisk] 15:11:17
        But is there a second ex scale machine in France someplace?

        [Ian Fisk] 15:11:22
        I I thought the only one was in Germany. Just understand.

        [Andrej Filipcic] 15:11:23
        David by the the official one that was accepted already. So they're going to proof is in Germany, Jupiter, Franz will likely come next year.

        [Enrico Fermi Institute] 15:11:30
        Okay.

        [Ian Fisk] 15:11:32
        Okay, nice okay.

        [Andrej Filipcic] 15:11:33
        I mean the call for for proposing

        [Ian Fisk] 15:11:39
        Thanks.

        [Enrico Fermi Institute] 15:11:41
        and from Maria

        [Maria Girone] 15:11:43
        maybe just I want to say that, hey, research infrastructure like, Okay, yay is, There's a lot of ongoing discussions.

        [Maria Girone] 15:12:01
        Sandra knows. Well, between let's say, the larger communities and your Hpc.

        [Maria Girone] 15:12:09
        In order to try to motivate further collaborations very much like those programs like this destination Earth, which indeed is a priority for European Commission.

        [Maria Girone] 15:12:21
        But we are also having a number of projects now that will allow us to do.

        [Maria Girone] 15:12:28
        Arindi. Some. I think we'll we, for instance.

        [Maria Girone] 15:12:35
        We will present tomorrow. Indie, So what we're doing with the d Eulich Super computer center for what concerns a development and use of a Gpu resources scale for distributed training the reason a pipeline in a also European project that will allow us to valuate open source

        [Maria Girone] 15:13:00
        solutions like risk 5 and sequel. So there are a number of opportunities and race side very well is very easy.

        [Maria Girone] 15:13:09
        Actually to work on the development side with your Hpc.

        [Maria Girone] 15:13:13
        And the we get granted the resources on for developers.

        [Maria Girone] 15:13:20
        So even within 5 days I mean, we didn't working week, so it's very, very, nice collaboration, at least at this, level, we need to build on this and go further, and that is less obvious.

        [Maria Girone] 15:13:33
        And we require your some common actions, let's say at least, when we are talking to your Hpc.

        [Enrico Fermi Institute] 15:13:47
        Samantha was first. Okay.

        [simonecampana] 15:13:49
        it's a follow-up on. Yes. Question. I think one of the requisites of for entering one of those special programs like I know.

        [simonecampana] 15:14:02
        I don't remember how they're called, but turn up, Grant based the more long term.

        [simonecampana] 15:14:06
        He's a first that you. You are an impactful science, and of course, so you know, it's really I'll be trying to define what is impactful.

        [simonecampana] 15:14:17
        But of course, it's the one who saves the health has a simpler way of demonstrating that impactful Also, if we want to apply for something like, this, I think it's important to make a a lot of progress on the software area.

        [simonecampana] 15:14:32
        because one of the other things one has to demonstrate is to use an Hbc.

        [simonecampana] 15:14:38
        For the value of an Hbc. And already don't use much of the interconnects. That An Hpc. Office.

        [simonecampana] 15:14:45
        So if we are also cheap on Gpus and the use of those architectures, then we become not such a great candidate for one of those one of those programs.

        [simonecampana] 15:14:55
        So. I think we have to build our story, and we have some technical improvements at the software level that now was to to build a better story, and my question is, if there is something like this in the Us.

        [simonecampana] 15:15:10
        because if there was, then we could try to build even a more coherent story across you.

        [simonecampana] 15:15:16
        Open the us.

        [simonecampana] 15:15:22
        Do you have a notion of sciences that get into a program once, and then there is a movie ear engagement with the Hbc facilities

        [Ian Fisk] 15:15:31
        Oh, I I think the dirt might be able to answer better, But I think this is one of the things that the Us.

        [Ian Fisk] 15:15:37
        There is a for definitely a push these days from the Us.

        [Ian Fisk] 15:15:40
        Funding agencies for science to make effect use of the large-scale computing facility.

        [Ian Fisk] 15:15:48
        And so, whether it it's not there's there's not a program like your Hbc.

        [Ian Fisk] 15:15:53
        Because it's only one country, but it does mean that there is a There is alert like if you look at where the national apps of main investments there.

        [Ian Fisk] 15:16:03
        A lot of the investments have been made in Central at facilities with the expectation that the calculations are done.

        [Ian Fisk] 15:16:07
        There.

        [simonecampana] 15:16:08
        Right.

        [Enrico Fermi Institute] 15:16:09
        Yeah, the the thing is, I mean at the moment they're very high level discussions going on that they because there's the push from the funding agencies that we should use more Hbc: And it's not just us it's in general because they pay for these facilities they want them

        [Enrico Fermi Institute] 15:16:25
        to be used, but that now is a pushback, and that's what the conversation is a very high level.

        [Ian Fisk] 15:16:27
        Good.

        [Enrico Fermi Institute] 15:16:32
        Lis mentioned something that there are groups talking about what they need to do in terms of changing their policies, to actually allow that because the the application process for the deed program for the lcc application process the inside application process is just not geared charles towards these used cases competitive poses that are unique that can only be done there which

        [Enrico Fermi Institute] 15:16:56
        is just not a good match, and that's it like this is above all pay scale.

        [Enrico Fermi Institute] 15:17:01
        Here. The these conversations are going on hopefully. Something comes out of it as we'll see

        [Taylor Childers] 15:17:05
        not sure that that is the case. I mean the so.

        [Taylor Childers] 15:17:10
        The insight program offers the opportunity to get up to 3 years of allocation through a competitive review process.

        [Taylor Childers] 15:17:19
        The challenge. Is laying out your case, and I would argue that the way you approach this for the leadership computing facility is you have to play to their mission right.

        [Taylor Childers] 15:17:34
        I mean, their mission is to provide the biggest computers, because people need them, not because Hi good

        [Enrico Fermi Institute] 15:17:40
        But Taylor is this is you basically have to sell it, and you have to sell it in there way that you basically dress it up as something that can.

        [Enrico Fermi Institute] 15:17:48
        You can only do there, and that's not what we want.

        [Taylor Childers] 15:17:52
        I agree, and but I would argue that you can easily make the case based on the fact that you are reaching great scenario, and if you don't get access to the machines, then you'll be able you'll be slower in your science, achievements, and I

        [Enrico Fermi Institute] 15:17:52
        Thanks.

        [Taylor Childers] 15:18:14
        think that's a viable, are you? I think the part where you guys have trouble in, especially an insight program proposal is the fact that you don't have enough of the workloads that take advantage Gpus, right?

        [Enrico Fermi Institute] 15:18:31
        Yeah, that's probably we're trying Lcc: right now.

        [Taylor Childers] 15:18:31
        I mean the challenges is

        [Taylor Childers] 15:18:35
        Yeah, for sure and

        [Enrico Fermi Institute] 15:18:35
        That's easier to justify. I think, if we ever get to the point that you could say, Okay, if we get like a huge insight proposal, we could, You can make the science use case if you can do something you couldn't otherwise, do because it basically adds, 50%, of your own capacity, or

        [Enrico Fermi Institute] 15:18:50
        whatever. But then a little bit kicks inside is still only you.

        [Enrico Fermi Institute] 15:18:55
        You do an allocation proposal, and you get the decision, and then you get it like 3 months later, a few months later, It's too short a time scale.

        [Enrico Fermi Institute] 15:19:05
        You would basically have to ask a year or 2 in advance to fit our planning process within the experiment, because you can't just drop that on top of Cms.

        [Enrico Fermi Institute] 15:19:14
        And expect that we basically throw our plans out the window.

        [Enrico Fermi Institute] 15:19:18
        And now effectively use

        [Taylor Childers] 15:19:19
        Yeah, no, there. There definitely needs to be more discussions above our pay rate.

        [Taylor Childers] 15:19:25
        I mean the challenge. There is to some extent you have to change how the leadership computing facilities are are reviewed so that we can accommodate stuff like that

        [Enrico Fermi Institute] 15:19:44
        you're so I'm sorry. The the you mentioned that obviously these machines are a mixed Cpu Gpu.

        [Enrico Fermi Institute] 15:19:55
        Next generation will they be more well, More of the flops and actual compute power power Use be in the accelerator realm, or will there be some machines where arm sort of provides the every lifting

        [Andrej Filipcic] 15:20:14
        well how does it say hard to predict? But that, in my opinion, there will be always machines built in that way, That's as many user communities can use them.

        [Andrej Filipcic] 15:20:24
        So, and several sites know that already. Right? So nobody will go to a complete dedicated machine, for example, even Jupiter, which is excel scale, It will be easier to build it up and reach the highest top 5 500 number only going only gpu right but they don't

        [Andrej Filipcic] 15:20:45
        want that I mean, nobody would actually want that. So on on cpus.

        [Andrej Filipcic] 15:20:53
        It depends right. But so, still quite many users are used in x 86, right so.

        [Andrej Filipcic] 15:21:00
        But arm is not so difficult in respect. If you use Cpu only part. When you have Gpu it will be slightly different, but arm will definitely be a larger players in the next couple of so something like that

        [Enrico Fermi Institute] 15:21:14
        but my take

        [Enrico Fermi Institute] 15:21:14
        But my takeaway from what you've just said is that at least for the next generation likely to have as a significant Cpu footprint, because they're sort of mandated to be as usable as possible to the communities the the broader comp broader, scientific and such communities

        [Andrej Filipcic] 15:21:35
        Right yup

        [Enrico Fermi Institute] 15:21:36
        right, Okay, thanks. Okay, let's let's move on So there's any other questions for hunger.

        [Enrico Fermi Institute] 15:21:48
        I think we should move on. Hey? Thank you.

        [Andrej Filipcic] 15:21:50
        Well, welcome!

         

        M100

        [Enrico Fermi Institute] 15:21:52
        Okay, we have a couple of slides from some European Cms.

        [Enrico Fermi Institute] 15:21:57
        Efforts, recipes. Daniela isn't connected, I think, unless if he's here, you should speak up.

        [Enrico Fermi Institute] 15:22:04
        He told me he couldn't. So this is this.

        [Enrico Fermi Institute] 15:22:08
        Is integration basically at the Seneca, at the Canal Tier one.

        [Enrico Fermi Institute] 15:22:13
        So they have the co-locator visiting the same data center.

        [Enrico Fermi Institute] 15:22:18
        There is the Seneca Mcconnie, 100 Hbc.

        [Enrico Fermi Institute] 15:22:21
        Which is basically a clone in terms of system architecture to to summit.

        [Enrico Fermi Institute] 15:22:26
        So it's power plus and video, and they they integrated it as a subside of the Tia one.

        [Enrico Fermi Institute] 15:22:32
        So since they're co-located on the same data center, they have really fast network interconnect.

        [Enrico Fermi Institute] 15:22:37
        They tie it together. The Hbc. Can see basically the kind of T.

        [Enrico Fermi Institute] 15:22:42
        One storage system. You saw The services are provided by the data center, and they run it as a subset of the T one.

        [Enrico Fermi Institute] 15:22:52
        So the Cms operations only sees the T one, and then they can internally via some pilot customizations they can select, which parts of the workflow that are centered at Tijuana can run on the Hpc.

        [Enrico Fermi Institute] 15:23:05
        Side, And they basically where we are today, it says, almost complete.

        [Enrico Fermi Institute] 15:23:09
        I think it is complete now, because the announcement came out after the slide was sent to me.

        [Enrico Fermi Institute] 15:23:15
        You see, some slides how it's how it's integrated. So on.

        [Enrico Fermi Institute] 15:23:18
        Did you see it in the in the Monet? The sub-site concept has some unique challenges in how you monitor it.

        [Enrico Fermi Institute] 15:23:27
        Good.

        [Paolo Calafiura (he)] 15:23:27
        I'm sorry. Are we looking at lights because we these slides

        [Enrico Fermi Institute] 15:23:31
        I that I forgot to re share it. Oh, that is yeah, I mean, bring it back up the because in the U.

        [Paolo Calafiura (he)] 15:23:33
        Alright.

        [Enrico Fermi Institute] 15:23:41
        S. All the Hpc sites they're using.

        [Enrico Fermi Institute] 15:23:44
        We basically put the concept of it. See a 3 grid side on top of it, which makes the monitoring and accounting, and so on really easy, because everything is important and there's a unique sign if you have a subside, and is a little bit more difficult because everything is kind of hidden in the under the umbrella, of

        [Enrico Fermi Institute] 15:24:02
        the T one, and then you have to kind of dig into this like some subfields and identify us to, and there has been some work on going in the monitoring and the monitoring sites on Cms to to make that easier Doesn't.

        [Enrico Fermi Institute] 15:24:16
        This model make it easier to accommodate. It makes it makes perfect sense, I mean, for them.

        [Enrico Fermi Institute] 15:24:25
        It's great because they're I mean, they're co-located anyways.

        [Enrico Fermi Institute] 15:24:29
        It's it makes perfect sense for it's a bit more difficult if you're like geographically, is organizationally separate and entities.

        [Enrico Fermi Institute] 15:24:40
        So in the Us. It's kind of difficult, because the Hbc are usually stand alone, So So what is it that has changed between few years ago And now?

        [Enrico Fermi Institute] 15:24:51
        With regard to Cbm Fs. It seems like initially, people were very, very wary of it, You don't want to put this on our Hpc.

        [Enrico Fermi Institute] 15:24:58
        Because it'll crash everything or whatever I mean. Is it is it Technology has gotten better?

        [Enrico Fermi Institute] 15:25:02
        Or is that people have gotten less afraid of it? Maybe people have got less of for it just became familiar with Also, a lot of people are using it, not just us that.

        [Enrico Fermi Institute] 15:25:12
        Helps and then I don't worry about it anymore, because any recent machine with the recent Os no problem running Cdm: Fs: access Yeah.

        [Enrico Fermi Institute] 15:25:24
        It just built my own, and I mean from the ocean side we bring on new sites, because we only use see?

        [Enrico Fermi Institute] 15:25:31
        If you have a phone, Zack, and only if they directly ask can I please run?

        [Enrico Fermi Institute] 15:25:37
        Cvs or if they have any other problems, do we give them the option by Why, why have that conversation? Sure somebody's not hitched into it?

        [Enrico Fermi Institute] 15:25:53
        Okay, even on the Lcf. No problem. Be it worked on, Sayta out of the box.

        [Enrico Fermi Institute] 15:25:58
        Physically it worked on summit, out of the box so I didn't know the issues.

        [Enrico Fermi Institute] 15:26:01
        I have to go click at the squid there so I can actually write it on the batch node.

        [Enrico Fermi Institute] 15:26:05
        But it worked on the logarithm, which runs the same operating system


        BSC update

        [Enrico Fermi Institute] 15:26:10
        And then, Antonio, are you connected? Okay, So Antonio can say a few words on what we're doing at Marinosa

        [Antonio Perez-Calero Yzquierdo] 15:26:12
        Hi! Yes, I am! Can you hear me?

        [Antonio Perez-Calero Yzquierdo] 15:26:17
        Yeah, okay, So yeah, I don't know. Zoom is for is the current supercomputer in?

        [Antonio Perez-Calero Yzquierdo] 15:26:26
        and this is the the largest Sbc center in Spain.

        [Antonio Perez-Calero Yzquierdo] 15:26:29
        I don't know through 5 days. Plan is actually in the procurement face as a explain before so we are accessing bsc and Madamos room as a project mediated by pick So that's the double CD.

        [Antonio Perez-Calero Yzquierdo] 15:26:48
        Spanish tier one, and fortunately, let's say, interestingly, the Lc.

        [Antonio Perez-Calero Yzquierdo] 15:26:54
        Computing has been designated as the strategic project in Vsc program.

        [Antonio Perez-Calero Yzquierdo] 15:26:58
        So this means that we basically well, we still have to request the allocation.

        [Antonio Perez-Calero Yzquierdo] 15:27:02
        But we are getting quarterly grants of about 6 or 7 million hours.

        [Antonio Perez-Calero Yzquierdo] 15:27:09
        A Yeah, available at this for for Cms. And I think it's about the same amount, for for Atlas.

        [Antonio Perez-Calero Yzquierdo] 15:27:16
        So we are getting these allocations. Let's say, regularly.

        [Antonio Perez-Calero Yzquierdo] 15:27:19
        Okay, however, the case is very difficult for for Cms.

        [Antonio Perez-Calero Yzquierdo] 15:27:24
        The environment is extremely challenging, because well, for security reasons, no incoming or outgoing connectivity is allowed in the compute notes.

        [Antonio Perez-Calero Yzquierdo] 15:27:36
        this means that well, everything that needs to happen for for the same, it will run a job like what I have in now, on the on the right hand side, accessing what being connected to the water management being able to to access the software of course conditions data and finally access to storage all these things

        [Antonio Perez-Calero Yzquierdo] 15:27:54
        are at a Yeah, a cat. Basically, all this connection, even, we have in recently discussing the possibility of having some added the it's services.

        [Antonio Perez-Calero Yzquierdo] 15:28:07
        And this is not the not not even this is is allowed.

        [Antonio Perez-Calero Yzquierdo] 15:28:11
        So of course, I shall stop at 4 for Cms, as tasks require.

        [Antonio Perez-Calero Yzquierdo] 15:28:15
        Stephen, services such as the ones I. That is correct.

        [Antonio Perez-Calero Yzquierdo] 15:28:20
        What we have is a login note which allows a site and a share file system mounted on on the execute notes.

        [Antonio Perez-Calero Yzquierdo] 15:28:29
        And And yeah, we can access this. This distributed file system. Be Sh: Fs: So what we are doing Well, he use these capabilities to to build the the model that you can see in the next slide which requires a sensible substantial amount of integration.

        [Antonio Perez-Calero Yzquierdo] 15:28:49
        Work, Yeah, So what the components that that we have, let's say in our favor to make this thing work is, first of all is the condor split Startup.

        [Antonio Perez-Calero Yzquierdo] 15:28:57
        So it uses the the share file system as a communication layer for the job.

        [Enrico Fermi Institute] 15:29:02
        Yes.

        [Antonio Perez-Calero Yzquierdo] 15:29:02
        Management. Well, you can see. Yes, Abc. And D.

        [Antonio Perez-Calero Yzquierdo] 15:29:10
        In the in the in the diagram below, where basically condor is kind of well, it's communicating between the study, and they actual starter where they were.

        [Antonio Perez-Calero Yzquierdo] 15:29:19
        The job run let's say, is communicating via passing files through the file system.

        [Antonio Perez-Calero Yzquierdo] 15:29:23
        Okay, then for software, what we do is basically replicate the Cbm Fs and repositories and Bsc: we, we we get what we need a peak.

        [Antonio Perez-Calero Yzquierdo] 15:29:34
        And then basically send the files and and be in the environment What are the Nbsp.

        [Antonio Perez-Calero Yzquierdo] 15:29:40
        For the conditions. Data is, we cannot access a databases, remote databases.

        [Antonio Perez-Calero Yzquierdo] 15:29:46
        We have to pre fetch those conditions, make them into files, pretty, place them into Bsc.

        [Antonio Perez-Calero Yzquierdo] 15:29:51
        And finally for storage concerns, we have developed our own service for input and output data transfers initially for output. Now for the stage out, let's say now, we are also commissioned this for for like.

        [Antonio Perez-Calero Yzquierdo] 15:30:08
        That so it's kind of white comboluted.

        [Antonio Perez-Calero Yzquierdo] 15:30:12
        The system you can see on the 2 2 extremes of on the diagram, Cern: Of course, the Cms water management system the storage, etc.

        [Antonio Perez-Calero Yzquierdo] 15:30:20
        And, on the other hand, the Bsc. And how we have to build all this intermediate layer at the up.

        [Antonio Perez-Calero Yzquierdo] 15:30:27
        Pick this bridge. Okay, next, please. Yeah, So what's the current status?

        [Antonio Perez-Calero Yzquierdo] 15:30:35
        Okay, system. The system works the services, and infrastructure that we have deployed.

        [Antonio Perez-Calero Yzquierdo] 15:30:41
        this as a allowed us already to to run a test, very reasonable scale.

        [Antonio Perez-Calero Yzquierdo] 15:30:47
        15,006 Cpu cores in in modern option 5.

        [Antonio Perez-Calero Yzquierdo] 15:30:51
        This is realistic Cms jobs, and this is an integrated.

        [Antonio Perez-Calero Yzquierdo] 15:30:56
        Well, I aggregate the output rate of 500 megawatts per second.

        [Antonio Perez-Calero Yzquierdo] 15:31:00
        Okay, So it's capable of sustaining societies.

        [Enrico Fermi Institute] 15:31:03
        Yeah.

        [Antonio Perez-Calero Yzquierdo] 15:31:03
        So the staging out works is commissioner and ready as I'm as I'm mentioning.

        [Antonio Perez-Calero Yzquierdo] 15:31:12
        Yup, probably. Let's say now, Okay, it's actually in discussing this Cms: workloads that can fit into this model.

        [Antonio Perez-Calero Yzquierdo] 15:31:21
        And with the constraint that I explained before. So what we, what would I call realistic Senior Cms workloads so far, this tests are Gen: same task change jobs.

        [Antonio Perez-Calero Yzquierdo] 15:31:32
        For example, in this case, minimum bias production. So it means there is no access.

        [Antonio Perez-Calero Yzquierdo] 15:31:38
        Cool, Okay, or there is no input, data. A full simulation, however, in the style that same as mostly performs, is in the form of a step chain.

        [Antonio Perez-Calero Yzquierdo] 15:31:49
        So it's a single single condor job running all the 4 stages since him did there Rigo, Where in the in this 2, stages they pile up libraries are access be enterprise.

        [Antonio Perez-Calero Yzquierdo] 15:32:04
        Okay, so we can have a triple A, So what we could do in order to be able to run this full step chain is to copy the premix data samples into the Ac.

        [Antonio Perez-Calero Yzquierdo] 15:32:15
        we have, let's say, ask about that this possibility.

        [Antonio Perez-Calero Yzquierdo] 15:32:19
        But but okay, copying data sets all the size about the of about the petabyte.

        [Antonio Perez-Calero Yzquierdo] 15:32:28
        It's not the currently allowed. There's no, there's no capacity in the Karate marinosum for that. Perhaps in modern option.

        [Antonio Perez-Calero Yzquierdo] 15:32:35
        5 dimension, but not that at present. Okay, So that rules out this type of phone simulation.

        [Antonio Perez-Calero Yzquierdo] 15:32:41
        What to look, let's say, and then what we are doing right now is commissioning this stage.

        [Antonio Perez-Calero Yzquierdo] 15:32:48
        The stage in right. So So this customize data transfer service in order to push files from pick a storage for simple, but it even we could get through triple a into peak and then They are into Bsc.

        [Antonio Perez-Calero Yzquierdo] 15:33:02
        In order to enable and running workflows which require input data.

        [Antonio Perez-Calero Yzquierdo] 15:33:05
        For example, we are thinking of participating or enabling broader reprocessing at the admiration.

        [Antonio Perez-Calero Yzquierdo] 15:33:14
        And this is the current situation It's it's not only okay.

        [Antonio Perez-Calero Yzquierdo] 15:33:19
        Let's say, in relation to many things that have been discussed so far.

        [Antonio Perez-Calero Yzquierdo] 15:33:23
        Yeah, in this workshop. It's not only the the the capabilities that that we are allowed, or that actually that we're not allowed to to to have a Dsc: together with how Cms operates for example, step chains are preferred over does change right So this already restricts

        [Antonio Perez-Calero Yzquierdo] 15:33:45
        very much what we can do in in Bfc. I think that's that's it.

        [Enrico Fermi Institute] 15:33:53
        I just wanted to have a call

        [Enrico Fermi Institute] 15:33:55
        I just wanted to have a comment. This was but Antonio showed the split, starter method that this HD Corner integration.

        [Enrico Fermi Institute] 15:34:01
        That's actually what we did. What we used for the Lcf.

        [Enrico Fermi Institute] 15:34:06
        Theta, integration, the prototype integration that we used during the the 2120, 21 Lcc.

        [Antonio Perez-Calero Yzquierdo] 15:34:10
        Yeah.

        [Enrico Fermi Institute] 15:34:14
        It worked that, too. It's it's a little simpler there even then, because you, since you do have edge services that you can call out from the edge.

        [Enrico Fermi Institute] 15:34:22
        So certain things are not quite as complicated as Pcs.

        [Enrico Fermi Institute] 15:34:25
        But we followed the same general integration, principle.

        [Antonio Perez-Calero Yzquierdo] 15:34:28
        yeah, that our case I don't know. I I would say it's particularly interesting because we are really being asked and enforce a right.

        [Antonio Perez-Calero Yzquierdo] 15:34:37
        We we have been asking false to use marinos room bye, from the funding agency point of view.

        [Antonio Perez-Calero Yzquierdo] 15:34:44
        Right, I mean, Oh, yeah, we have the the notion that Cpu is going to be got in in in further incoming request, let's say, funding requests for for our Lc computing projects how about on the other hand Bsc: is not very friendly in terms of allowing things that will

        [Antonio Perez-Calero Yzquierdo] 15:35:06
        make the integrate.

        [Enrico Fermi Institute] 15:35:07
        And at the funding agency have no no way to influence Pcs.

        [Enrico Fermi Institute] 15:35:11
        They can just say No, we don't

        [Antonio Perez-Calero Yzquierdo] 15:35:13
        It's okay. Yeah, it's like A, It's kind of I don't know.

        [Antonio Perez-Calero Yzquierdo] 15:35:15
        I see it as kind of paradoxical, because really we're kind of been trapped between the 2 forces squeezing us in the in the middle.

        [Antonio Perez-Calero Yzquierdo] 15:35:23
        Right? So yeah, it's it's making it quite a lengthy and and and at those project to integrate this.

        [Antonio Perez-Calero Yzquierdo] 15:35:32
        Well, we are advancing. We are trying actually to make it as universal as possible.

        [Antonio Perez-Calero Yzquierdo] 15:35:37
        Let's say, in in relation to Cms workflows, because otherwise it would not be able to.

        [Antonio Perez-Calero Yzquierdo] 15:35:44
        We will not be able to use the resource. But again, it's it's it's difficult

        [Enrico Fermi Institute] 15:35:50
        Okay, Any other questions, comments.

        [Ian Fisk] 15:35:56
        I I had one which which is sort of to Antonio, and sort of, I think, to the larger group which is, do we, hey, Chris?

        [Ian Fisk] 15:36:04
        We want to take advantage of sort of the Wlcp.

        [Enrico Fermi Institute] 15:36:06
        It's

        [Ian Fisk] 15:36:09
        And the sort of the larger organization structures that we have to basically say that network connectivity downside is some is is necessary to work.

        [Ian Fisk] 15:36:21
        I think it's it's really it's it's very impressive technical work to be able to go around this.

        [Ian Fisk] 15:36:25
        But this is something that we could sort of like. I wonder if there'd be any benefits sort of pushing from Mobile

        [Antonio Perez-Calero Yzquierdo] 15:36:34
        yeah, I'm not. I'm not usually involved in the in the political discussions.

        [Antonio Perez-Calero Yzquierdo] 15:36:40
        so I I couldn't tell myself. I I don't know if see money, for example, with the we provide

        [Enrico Fermi Institute] 15:36:47
        I mean, I mean the one thing, Antonio. If you said that they want to reduce your funding for great computing and replace it with Hbc.

        [Enrico Fermi Institute] 15:36:55
        I mean at that point they need at that point I think they expect that that Hbc allocate the capacity kind of counts as a replacement, and don't they need.

        [Antonio Perez-Calero Yzquierdo] 15:36:55
        Yeah.

        [Enrico Fermi Institute] 15:37:07
        Like Ws. G agreement at that point, but they actually consider this to be an equivalent replacement

        [Antonio Perez-Calero Yzquierdo] 15:37:13
        yeah, in principle, the idea is that for for Cpu intensive workloads estimated at about 50% of the Cpu requirement.

        [Antonio Perez-Calero Yzquierdo] 15:37:24
        No request, 50% would be provided by by the Yeah, by Cpc.

        [Antonio Perez-Calero Yzquierdo] 15:37:30
        And then we still would have some Cpu for data processing.

        [Antonio Perez-Calero Yzquierdo] 15:37:34
        Let's say, for the usual Oh, there'd be a one that's kind of the idea.

        [Antonio Perez-Calero Yzquierdo] 15:37:40
        But in order to do that, the yeah, like, I said, we, we we are being forced a kind of to transform this into a and universal resource, which is, which is not yeah, is very much not so

        [simonecampana] 15:37:52
        yeah, to commit to comment. Several people talk to to the funding agents, including myself, talk to the funding agency and the pick, and also to be a C.

        [Enrico Fermi Institute] 15:38:02
        Yeah.

        [simonecampana] 15:38:05
        But it seems to be a triangle that doesn't really understand each other.

        [simonecampana] 15:38:11
        So I think what Antonio is saying is correct. They're trying to push this on.

        [simonecampana] 15:38:17
        The throat, and of course we are trying to push back. Now.

        [simonecampana] 15:38:21
        Of course funding agents. Is not obliged. The pledge right?

        [simonecampana] 15:38:27
        I mean good the funding. It just says, Okay, this is the money we have. And you know, if you want X Tab, you can not okay to use this

        [Ian Fisk] 15:38:35
        Okay, I guess some money that point with, and my point was sort of like.

        [Ian Fisk] 15:38:40
        Did we want to? When we're writing the email we set in relatively strict criteria about, but services needed to be run, and what the expectations were in terms of quality of service, and availability but also in the development of the protocol and this occurs to me as a place where

        [Ian Fisk] 15:38:58
        like the Wsg. Could decide that one of the protocols that's necessary to be considered a site is this?

        [Ian Fisk] 15:39:05
        And it doesn't. It's not guaranteed to work.

        [Ian Fisk] 15:39:07
        But I think that in some sense exciting without it is almost guarantees that it will not change

        [simonecampana] 15:39:14
        Yeah, I mean, it would be useful if the I would say the peak management would make on these for my former request to Wcg: because a a reality what pick has done is to do a lot of diligent work to try to overcome the limitations.

        [Ian Fisk] 15:39:32
        Right.

        [simonecampana] 15:39:34
        hey? It would be good if this would be the other way around, and at some point they would say, cannot do so.

        [simonecampana] 15:39:41
        We can not offer tier one services with this piece of facility, and then we would have a discussion with the funding agency on on those basis at the moment those discussions they led, the not too much, to be honest I don't know if Antonio has more detail that's what I understand also from

        [Enrico Fermi Institute] 15:39:54
        Okay.

        [simonecampana] 15:39:58
        pip

        [Enrico Fermi Institute] 15:39:59
        Could be try to move on and maybe move that offline, because it's not yeah, it's interesting, but it's also it's internal Wsg Spanish funding agency.

        [Antonio Perez-Calero Yzquierdo] 15:40:10
        yeah, Thank you.

        [Enrico Fermi Institute] 15:40:11
        So that's not relevant to the I think we have one more presentation and then we still need to have the cost discussion.

      • 13:50
        ANL update 20m

        ANL update

        Speaker: Taylor Childers (Argonne National Laboratory (US))

        ANL slides


        [Enrico Fermi Institute] 15:40:16
        Yeah. So running a little late. Yes, yeah, let's let's move on Taylor.

        [Enrico Fermi Institute] 15:40:24
        Do you have slides for us

        [Taylor Childers] 15:40:28
        Yeah, I have a few slides

        [Enrico Fermi Institute] 15:40:30
        Okay, great.

        [Taylor Childers] 15:40:36
        Hey!

        [Enrico Fermi Institute] 15:40:40
        Right.

        [Taylor Childers] 15:40:41
        So Hi, this is a disclaimer. This is a disclaimer to make sure I don't do anything silly, but you know the point is this: my own outlook.

        [Taylor Childers] 15:40:52
        On the future. I I'm not presenting any inside information about Yeah, I I don't even know what's coming after.

        [Taylor Childers] 15:41:00
        Aurora. There are people at Argon that do, but not me.

        [Enrico Fermi Institute] 15:41:06
        But but Aurora is still coming right. That's

        [Taylor Childers] 15:41:08
        Yeah, if there's anything is real that Aurora is still coming. That's been the case for far too long.

        [Enrico Fermi Institute] 15:41:21
        still coming.

        [Taylor Childers] 15:41:22
        Yeah, it's still coming. Okay? So going back and updated this plot from a long time ago to provide provide a quick update where things are in the Us.

        [Taylor Childers] 15:41:38
        we've talked about this at length. At this point but I think it's also useful to look at it in the context of the Lhc.

        [Taylor Childers] 15:41:48
        Runs right By the time the high Lumi Lhc turns on, we're gonna be dealing with the machines.

        [Taylor Childers] 15:41:54
        We don't even know what they look like yet, and a lot can happen between now and then that can affect how those machines look.

        [Taylor Childers] 15:42:05
        So we now have frontier deployed, so the Us.

        [Taylor Childers] 15:42:11
        Has its first ex- scale machine. We'll have Aurora coming online by the end of the year, and the next generation, which machines, you know, like, I said, we don't know what those are everything that we have is.

        [Taylor Childers] 15:42:26
        Sort of intel Nvidia Amd. I would come expect these to follow similar trends amazingly because of politics of it.

        [Taylor Childers] 15:42:37
        All right. I mean, we're spending us taxpayer money, and they want that to go to us corporations.

        [Taylor Childers] 15:42:44
        so I expect those will stay static. But of course, the variation in combinations, you can already see, are quite large, so those can still change

        [Taylor Childers] 15:43:00
        just a quick Put that in perspective. So I included the Japanese recent machine that they deployed the European machines that are have been announced I'm pretty sure there that this was confirmed in andrea slide or the slides on the euro.

        [Enrico Fermi Institute] 15:43:20
        Yeah.

        [Taylor Childers] 15:43:25
        Apc. That there's gonna be one more X access.

        [Taylor Childers] 15:43:29
        Give a machine, announced. So we know Jupiter is coming, and the plan was all right always to have €2 Hpc excel machines before 25.

        [Taylor Childers] 15:43:42
        I include China on here in principle they already have 3 ex scale machines, and in 10 to have 10 by 2425.

        [Taylor Childers] 15:43:52
        That's their goal. There's no reason they can't do that.

        [Taylor Childers] 15:43:55
        They seem to be willing to burn as much coal as possible to keep these machines at the Exa scale.

        [Taylor Childers] 15:44:02
        as I understand it, this one is just a giant.

        [Taylor Childers] 15:44:05
        Oh, no! That Tiana 3 is a giant upgrade of the 2.

        [Taylor Childers] 15:44:09
        So it's just a bunch of cpus, and there is no energy budget there.

        [Taylor Childers] 15:44:13
        So it's you know, a hot machine. The interesting thing about all of these is that they have various architectures that are very different.

        [Taylor Childers] 15:44:28
        Europe, has gone heavy into arm and eventually will go into the risk.

        [Taylor Childers] 15:44:33
        V. As an open source, accelerator format.

        [Taylor Childers] 15:44:37
        they're also, you know, into the sovereign.

        [Taylor Childers] 15:44:42
        Technology is. Everybody wants to, You know, there's stuff built here.

        [Taylor Childers] 15:44:47
        so the Japanese are using fruitsu chips.

        [Taylor Childers] 15:44:51
        The Europeans are trying to design their own I wouldn't be surprised if the arm and the risky stuff changes in the year in you, because I know you know Intel has already announced they're gonna open some boundaries in Europe and I think that's kind of help their image in the

        [Taylor Childers] 15:45:11
        area, so we'll see

        [Taylor Childers] 15:45:16
        So just a quick that look at at the distribution of of architectures.

        [Taylor Childers] 15:45:22
        So I took the top. 500. I made the cut off, and had to be bigger than 10 Petaflops.

        [Taylor Childers] 15:45:28
        That leaves me at about 50 machines, and I just flaps with the architectures frontier, really heavily dominates this now, so you can see, you know, The Amd, cpus and gpus from an ex scale machine compared to Everyone else.

        [Taylor Childers] 15:45:47
        so you can see right now, you know, outside of frontier in videos, really dominating the accelerators, there's a nice distribution of of cpus, and then I went ahead.

        [Taylor Childers] 15:46:03
        To 26, and tried to do the same plot.

        [Enrico Fermi Institute] 15:46:05
        Okay.

        [Taylor Childers] 15:46:10
        For what I think is coming. So by 2026 Us.

        [Taylor Childers] 15:46:16
        And Europe will both have 2 X and scale machines, like said China will have up to 10.

        [Taylor Childers] 15:46:20
        I didn't include the Chinese in this number largely because I mean, I have no idea the technique technicalities of what they're going to be running.

        [Taylor Childers] 15:46:32
        You're up has at least put out a roadmap, so their goal is to be using these arms, and the risky accelerators.

        [Taylor Childers] 15:46:41
        So if I include those at sort of, you know, over an exaflop.

        [Taylor Childers] 15:46:48
        then you start seeing this distribution. So you see, there's arm amd intel on the Cpu side, and a Amd.

        [Taylor Childers] 15:47:00
        Intel, And then this is essentially that risk v processor?

        [Taylor Childers] 15:47:04
        So if the Europeans decide to move to Nvidia or Intel, or Amd.

        [Taylor Childers] 15:47:11
        This green blob here will shift so you can see the The variation is, you know, early equal.

        [Taylor Childers] 15:47:24
        So then there's specialty hardware. So the du is has always been strong at in partnering with industry.

        [Taylor Childers] 15:47:33
        We really like pushing collaborations with industry. Alcf.

        [Taylor Childers] 15:47:40
        Host, the Doe Ai Test band, and currently we have 5 machines that are all custom silicon that are designs for running large learning jobs And so we've been working with those developers testing out their software And whatnot there's definitely an interest in identifying one or

        [Enrico Fermi Institute] 15:47:44
        Okay.

        [Taylor Childers] 15:48:04
        2 that you know, scientists like best, and then moving along with maybe making those as side car side cards to some future supercomputer. Right?

        [Taylor Childers] 15:48:18
        So you could imagine having the, you know, a couple of racks of these specialized chips available to you, to run your your Ai much much faster than a traditional Gpu or Cpu the other thing I wanted to say moving forward i'm close by Dorothea my kids are coming home to

        [Taylor Childers] 15:48:43
        school. The other thing is I wanted to mention was, of course, Ai for science, and in the context of Ecp so many of you Will be familiar with Ecp: The ex scale computing project Yeah.

        [Enrico Fermi Institute] 15:49:01
        Cool.

        [Taylor Childers] 15:49:01
        Was a large funded project on the Oscar side that you know The last number I heard is in principle.

        [Taylor Childers] 15:49:13
        It funded about a 1,000 ftees across the and it was all geared toward preparing for ex scale machines.

        [Taylor Childers] 15:49:24
        now with the landing of our 2 access can systems, This project's going to be ramping down, and there's a lot of worked to figure out what's going to come next.

        [Taylor Childers] 15:49:39
        And it really looks like Ai, for science is the next big push, so they're already.

        [Taylor Childers] 15:49:46
        It's already been 2 years now worth of workshops.

        [Taylor Childers] 15:49:50
        on the Oscar side, where we are trying to lay out the green ground Work for what such a project would look like, and how it would be managed, and what its goals would be so I expect that in the next you know 5 years that this is gonna be sort of a dominating.

        [Taylor Childers] 15:50:13
        force, just like Ecp. Was so just something to be aware of.

        [Enrico Fermi Institute] 15:50:15
        Thank you.

        [Taylor Childers] 15:50:19
        I think that's going to have a big impact on it.

        [Taylor Childers] 15:50:24
        How our systems look Yeah, in this next round of deployments.

        [Taylor Childers] 15:50:31
        So? Are there any. So the takeaways, I would say, future of architecture, and hpc facilities is quite diverse.

        [Taylor Childers] 15:50:40
        I expected to remain so, There might be some custom hardware, but it will be very niche is what I expect for Ai, and you'll just be picking up tensorflow and pike torch and running your software the way You would anywhere else.

        [Taylor Childers] 15:50:54
        I would say the software implications There are the using portable frameworks will be a benefit, and of course, the more we can complain and and voice our are interest in a standard support theme through the C standard.

        [Taylor Childers] 15:51:16
        2 companies I think that you know it's a good thing, but until everyone supports something like Std.

        [Taylor Childers] 15:51:23
        Par out of C standard, you know, using these third party libraries like cocos and Sickle and Peca, are probably gonna be the best way to go for the moment let's see, current ex scale machines.

        [Taylor Childers] 15:51:38
        I were largely decided before Ai became a real focus.

        [Taylor Childers] 15:51:43
        And do we science? And I expect that to be a bigger driver for the next round of systems that are coming that might again, of course, with the end, is in the energy budgets and competitive nature of these machines will probably driving them in the direction accelerators again, but things?

        [Taylor Childers] 15:52:07
        Shift quickly. It's hard to predict. So yeah, that's where I I leave that

        [Enrico Fermi Institute] 15:52:19
        But Tara had a quick question. I think it's on slide 3 where he kinda made the pie charts of.

        [Enrico Fermi Institute] 15:52:26
        yeah, if if you would try to make a single pie chart right?

        [Enrico Fermi Institute] 15:52:33
        If it's the problem pie charts, you can't tell the relative size how much larger is the Gpu flops currently versus the the Cpu flop.

        [Enrico Fermi Institute] 15:52:42
        Is there? Is there a way to get a don't all to to a single one?

        [Taylor Childers] 15:52:48
        Yeah, I mean. So any system that has accelerators can be dominated right Last time I calculated that was like probably was Summit, and there was, you know, on the level, 5 to 10 with Cpu flops.

        [Enrico Fermi Institute] 15:52:54
        Yeah.

        [Taylor Childers] 15:53:06
        and it got even worse whenever I did. The calculation for frontier and Aurora.

        [Taylor Childers] 15:53:13
        But it's been a long time since I looked at those

        [Enrico Fermi Institute] 15:53:17
        So I guess the point is, if it was drawn to scale like the Gpu pie chart would be 10 times larger than the Cpu, or 5 times 10 times not not the same size right

        [Taylor Childers] 15:53:23
        That's right.

        [Taylor Childers] 15:53:29
        For sure, for sure.

        [Enrico Fermi Institute] 15:53:32
        And and you're timing in what? What is other of the Gps here?

        [Taylor Childers] 15:53:36
        So

        [Enrico Fermi Institute] 15:53:38
        Is that the

        [Taylor Childers] 15:53:40
        Yeah, So that would be in this case. That would be the fidget suit

        [Enrico Fermi Institute] 15:53:47
        Okay.

        [Taylor Childers] 15:53:50
        I can look back in my spreadsheet, too.

        [Enrico Fermi Institute] 15:54:01
        They probably also explains why Barb is a larger piece than Kelvin

        [Taylor Childers] 15:54:06
        Oh, no! Sorry. In this one. The other is the T. On a 2, which is on the 500, and if one of these it's this one

        [Enrico Fermi Institute] 15:54:13
        Okay.

        [Enrico Fermi Institute] 15:54:21
        Okay, if you told me that was a 386 ship, I'd also believe you.

        [Enrico Fermi Institute] 15:54:26
        So okay, So Taylor performance portability. So if they does, that mean if it is, decide on a system design, they make the Lcf.

        [Enrico Fermi Institute] 15:54:39
        Or whatever fun stuff makes sure that it's supported by the performance.

        [Enrico Fermi Institute] 15:54:44
        Portability, libraries.

        [Taylor Childers] 15:54:46
        Well, and I think that's the benefit of something like Co.

        [Taylor Childers] 15:54:51
        Coast, which is a really it's a third party, the support right?

        [Taylor Childers] 15:54:55
        So Cocos came out of the Ecp project, and I imagine we'll continue to be supported.

        [Taylor Childers] 15:55:06
        and since it's third party, they can just come in and write a new plugin for whatever you know New Orleans comes along, and so as long as you use it, you paying the benefit from that I was when we first got we first, we're working with intel and sickle I was

        [Taylor Childers] 15:55:31
        very skeptical of sickle I mean, I'm in general.

        [Taylor Childers] 15:55:35
        I'm so skeptical of especially telling scientists to invest their time in the solution that's being pushed by one of the manufacturers.

        [Taylor Childers] 15:55:47
        Right I mean Cuda is a mess as a You know, someone who came up in in the sciences writing code.

        [Taylor Childers] 15:55:55
        I would never wish anyone to write code in Cuda, and so I approach sickle in the same respect.

        [Taylor Childers] 15:56:07
        but I mean it's getting good performance and it allows you to write your code once, and so far we've been able to run it on all 3 systems.

        [Taylor Childers] 15:56:17
        We run it, at least with Matt Graph. We have a sickle implementation, and it runs on the Amds, the Intel, and the Nvidia Gpus without any problem, and does very well, and Cocoa is the same with and like you said the nice thing about those 2 is that you write

        [Enrico Fermi Institute] 15:56:32
        See.

        [Taylor Childers] 15:56:37
        your code once, but with cuda the coulda implementation of ad graph right now is a riddled with compiler pre-compiler if depths everywhere, because if you're not on a computer device you need to run the C and they you know, it just becomes really hard to

        [Taylor Childers] 15:56:56
        maintain for someone who's not the dedicated software

        [Enrico Fermi Institute] 15:57:07
        Still have to cover the Hpc. Cost, I would like to at least attempted it to go through the slide where you have to see.

        [Enrico Fermi Institute] 15:57:14
        Okay, there's too long. Eventually we might have to cut it off and move it to tomorrow or something.

        [Enrico Fermi Institute] 15:57:18
        Yeah, we could could start a little earlier tomorrow I don't know how people feel about that.

        [Enrico Fermi Institute] 15:57:24
        Yeah, thanks, Taylor. Appreciate it. So let's try to go to the Hpc.

      • 14:10
        Cost 20m

        HPC Cost and discussions

        [Enrico Fermi Institute] 15:57:32
        Cost, and then we're right up on the Yeah.

        [Enrico Fermi Institute] 15:57:35
        There was a question on the charge or remember to share.

        [Enrico Fermi Institute] 15:57:41
        At this time the total cost of operating Hbc resources, and they especially included the the outlook to each, and and the thing is the the cost of operating it I mean This is really about operation acquiring and operating because you nominally they're free I mean

        [Enrico Fermi Institute] 15:58:02
        eventually there's some indirect effect, because you get them from the same funding agencies.

        [Enrico Fermi Institute] 15:58:07
        That fund you purchase hardware, but that's indirect, and that's also also the scope of this in this workshop.

        [Enrico Fermi Institute] 15:58:14
        So you you basically have to prepare your proposals once per year, usually access allows supplementals.

        [Enrico Fermi Institute] 15:58:22
        there's work on multi year proposals, and maybe that will mean that you still have to do a proposal each year.

        [Enrico Fermi Institute] 15:58:30
        But you don't have to do much work for it.

        [Enrico Fermi Institute] 15:58:31
        You just sign it off with your request. You already know what you're getting, and but this is a work in progress, and then there's technically integration, permissioning Mark and that's mostly one time.

        [Enrico Fermi Institute] 15:58:43
        is it you you integrate a facility once you find a way to make it work, and then you just have to maintain what you came up with, and this needs to be redone every free year.

        [Enrico Fermi Institute] 15:58:56
        Because these Hbc have a limited lifetime.

        [Enrico Fermi Institute] 15:58:58
        Basically, 5 years is around the maximum expect replace it with a different machine.

        [Enrico Fermi Institute] 15:59:03
        The what we experienced so far is the synergy effects.

        [Enrico Fermi Institute] 15:59:07
        If you stay within the same facility, because usually they have similar restrictions similar ways to do things so switching from one to to another cluster in the same facility, that when they do a replacement you you don't have to throw out everything and stuff from scratch you just make adjustments to what

        [Enrico Fermi Institute] 15:59:27
        you probably did before. It's there's an open question on the Lcf.

        [Enrico Fermi Institute] 15:59:34
        Integration, at least for a Cms Side I mean, you have your harvesteds for us at least long-term operational overheads.

        [Enrico Fermi Institute] 15:59:42
        There, a little harder to estimate They're likely also larger there, because the provisioning integration looks like it's gonna be a bit more complex, and not tight neatly into what we're doing anyways.

        [Enrico Fermi Institute] 15:59:57
        For the good size, So you need to do something special. Then support.

        [Enrico Fermi Institute] 16:00:02
        I mean, that's one of the things that came up in the context of pledging.

        [Enrico Fermi Institute] 16:00:07
        It's something you need to be able to send a ticket.

        [Enrico Fermi Institute] 16:00:11
        So there's operation support, because you don't have less Cms side contact.

        [Enrico Fermi Institute] 16:00:16
        Now, admittedly the grid says, Dt. Twos.

        [Enrico Fermi Institute] 16:00:19
        The side context is also someone usually the operations program baseball.

        [Steven Timm] 16:00:23
        hmm.

        [Enrico Fermi Institute] 16:00:24
        Is not that this is necessarily cost. That's unique to the Hbc.

        [Steven Timm] 16:00:29
        well, that

        [Steven Timm] 16:00:30
        Well, that I mean, if there's a problem if there's a problem in earthquake. Now, have call, team This is Jiggis ticket, and we respond to it.

        [Enrico Fermi Institute] 16:00:31
        Yes.

        [Enrico Fermi Institute] 16:00:38
        Yes, exactly. That's what I mean. I mean the T.

        [Steven Timm] 16:00:40
        So here cause here is the same contract

        [Enrico Fermi Institute] 16:00:42
        2. If there's a problem at Wisconsin, you filing a ticket, and the person that we pay money to, or funds to from the operations program.

        [Steven Timm] 16:00:51
        Okay.

        [Enrico Fermi Institute] 16:00:53
        At this constant response to it. So in that sense, it's not that different from Porting for side operations And and again, the other great example is the the door to grid folks use experiment specific oops.

        [Steven Timm] 16:00:55
        Good.

        [Enrico Fermi Institute] 16:01:09
        Teams are even W Someg: specific offs. Teams can be fairly far separated from the okay.

        [Steven Timm] 16:01:16
        Yes.

        [Enrico Fermi Institute] 16:01:19
        The the the people who are actually operating cluster.

        [Steven Timm] 16:01:20
        Yeah.

        [Enrico Fermi Institute] 16:01:21
        Yeah, yeah, and then I want to break that operation support into 2 components.

        [Enrico Fermi Institute] 16:01:27
        Because one is just normal work for support, just dealing. Oh, you have a lot of failures.

        [Enrico Fermi Institute] 16:01:33
        Can. You look into it? And you look at not funds, or whatever usually debugging of job failures And to first this a scales with the amount of resources because the more work you pass through the more problems you can expect and there's there's overlap here, with the normal

        [Enrico Fermi Institute] 16:01:50
        operations support by experiment, so that the first line so it defends that basically monitors overall workflow computing operations.

        [Enrico Fermi Institute] 16:01:59
        And then it goes to the point up to the point where you open the gigos ticket against the side, and then the second motors.

        [Enrico Fermi Institute] 16:02:07
        Then once, said, Geez, ticket is open. They're going to decide.

        [Enrico Fermi Institute] 16:02:09
        Whoever responds we'll have to have specialized Hbc integration knowledge, because some of these failure modes can be specific to how that Hpc.

        [Enrico Fermi Institute] 16:02:20
        Was integrated, and that that implies that there's a long term need to keep commissioning expertise around.

        [Enrico Fermi Institute] 16:02:28
        But we probably need to do that anyways, because of the Hbc.

        [Enrico Fermi Institute] 16:02:35
        Cluster, turnover. So you need to do the the commissioning efforts need to be redone.

        [Enrico Fermi Institute] 16:02:40
        So that's kind of if you're talking many Hpcs so there's constantly a need to work on this stuff We've been doing this long enough.

        [Enrico Fermi Institute] 16:02:48
        Can't you estimate what those labor costs are?

        [Enrico Fermi Institute] 16:02:52
        zoom ftes. Yeah, you can. You can try to come up.

        [Steven Timm] 16:02:54
        Right.

        [Enrico Fermi Institute] 16:02:55
        I mean, we've done it for multiple years, I can for the user facilities.

        [Steven Timm] 16:02:57
        Oh!

        [Enrico Fermi Institute] 16:03:00
        You definitely can do it. The Lcf. As I said, I'm unsure because I don't know what the long-term stable operations.

        [Enrico Fermi Institute] 16:03:08
        Mode will look like at the moment that still need to be done.

        [Enrico Fermi Institute] 16:03:11
        But the user facility is definitely, We can come up with an essay and then with Tlcs.

        [Steven Timm] 16:03:14
        Right. I mean

        [Enrico Fermi Institute] 16:03:17
        Can you write down? Why, you can't get what you need from that, so that the document you can make an estimate.

        [Enrico Fermi Institute] 16:03:25
        But you can qualify it. No, no; What I mean is, you can do it in the user facility right?

        [Steven Timm] 16:03:27
        Right.

        [Enrico Fermi Institute] 16:03:30
        And then because they have these these properties in the Lcs. You can't.

        [Steven Timm] 16:03:34
        Right.

        [Enrico Fermi Institute] 16:03:35
        You can put some error. Bars, but they're missing these properties.

        [Enrico Fermi Institute] 16:03:39
        They had those properties that the user facility had. Would that allow you to give a more perspective estimate for the Lcs.

        [Enrico Fermi Institute] 16:03:45
        You see what I'm saying Obviously, something about the way the user facilities are set up.

        [Steven Timm] 16:03:45
        Okay, Well.

        [Enrico Fermi Institute] 16:03:51
        The Steve on Steve, Steve.

        [Steven Timm] 16:03:52
        Yes, hey! You you have 2 components. So what is the meanings?

        [Steven Timm] 16:03:59
        Were one of them is when the remote site changes, their Api.

        [Steven Timm] 16:04:03
        The way you have to log in. Okay, done 4 times in 6 years.

        [Steven Timm] 16:04:07
        Now breaking, breaking, if here is that we used, and having to change it.

        [Steven Timm] 16:04:13
        So that's one end of things. So I mean, this is fairly straightforward.

        [Steven Timm] 16:04:19
        I mean this is that's the moment. You should expect that it would change the other part of it is stuff, but upstream of us, for instance, I'm talking it's organization.

        [Steven Timm] 16:04:31
        I mean There, we still haven't quite Got done. All the various hecks that are done to get into the Hpc.

        [Steven Timm] 16:04:40
        Sites don't necessarily translate, as well as a regular site would need more work to be done.

        [Steven Timm] 16:04:43
        There. So if you have a big change in the upstream, most G, or things like that that can really throw us for loop

        [Enrico Fermi Institute] 16:04:53
        That's what I meant by technical integration commissioning work.

        [Enrico Fermi Institute] 16:04:56
        That there's a long-term maintenance effort.

        [Steven Timm] 16:04:56
        Alright.

        [Steven Timm] 16:04:59
        Well, it

        [Enrico Fermi Institute] 16:04:59
        There's always there was a bit special, so there's always the chance that something will break, and you have to do

        [Steven Timm] 16:05:05
        Right. You need somebody that can read it. Understand? Factory logs, basically.

        [Steven Timm] 16:05:08
        And in, and he called me got it

        [Enrico Fermi Institute] 16:05:11
        And at the maintenance isn't necessarily a evenly distributed.

        [Enrico Fermi Institute] 16:05:15
        For instance, no so much type thing right? Sometimes 6 months nothing happens, and then like something goes boom.

        [Steven Timm] 16:05:17
        Great Great. Hey? Then you have to allow for the fact that some of these people don't answer their tickets very well at all.

        [Steven Timm] 16:05:28
        Yeah, in particular, just good. So maybe he's got a thing to people who listen to them.

        [Steven Timm] 16:05:38
        We'd like to hear it, because we have very little luck

        [Enrico Fermi Institute] 16:05:44
        And

        [Steven Timm] 16:05:45
        okay.

        [Enrico Fermi Institute] 16:05:47
        Okay. But I think we can. We can do. We can do an attempt here to to estimate us in terms.

        [Steven Timm] 16:05:52
        Yeah, yeah, yeah, sure.

        [Enrico Fermi Institute] 16:05:52
        Of fts, we can probably on a S existing has to be said. We have for good size 52 sites, which is also an index.

        [Steven Timm] 16:05:59
        Well.

        [Enrico Fermi Institute] 16:06:02
        So science to to

        [Steven Timm] 16:06:02
        So the the amount of effort there to help up with into maintenance is well known.

        [Enrico Fermi Institute] 16:06:07
        Yeah, but I also

        [Steven Timm] 16:06:09
        And so basically 30% of me, basically, that's what it is.

        [Steven Timm] 16:06:14
        So

        [Enrico Fermi Institute] 16:06:15
        So, but all fts are not created equal, so somehow you have to capture the skill set that F, T. E.

        [Steven Timm] 16:06:18
        Good.

        [Enrico Fermi Institute] 16:06:22
        S. Yeah, Then that's harder to do in terms of a high-level document to I know it's harder, but you have to.

        [Enrico Fermi Institute] 16:06:35
        Good. But well, in yeah, Atlas and Cms have solved the same problem.

        [Enrico Fermi Institute] 16:06:40
        2 slightly different ways, and that requires 2 different skill sets a political and ethical.

        [Enrico Fermi Institute] 16:06:47
        The the one that I real like that we should hammer on is the difference of these costs.

        [Enrico Fermi Institute] 16:06:54
        For Lcf type type Facility versus user. So I think you could probably to communicate that more effectively.

        [Enrico Fermi Institute] 16:07:03
        That's probably That might be the order. Sure.

        [Steven Timm] 16:07:04
        oh! I mean, there's ongoing dove work and there's gonna be ongoing dev work on the Lcf side, too.

        [Steven Timm] 16:07:11
        I mean good, significant dev work. There.

        [Enrico Fermi Institute] 16:07:12
        Yeah, that's the But that's a one-time cost.

        [Enrico Fermi Institute] 16:07:14
        We also will want to try to estimate what the long-term operational support is, and there will be large Arab bars.

        [Enrico Fermi Institute] 16:07:22
        But we can. You can make an attempt

        [Steven Timm] 16:07:23
        Right.

        [Enrico Fermi Institute] 16:07:26
        And then there's another apart from the cost and effort, efforts that are directly associated with Hbc operations.

        [Enrico Fermi Institute] 16:07:36
        There's a secondary component. That's a bit more indirect and harder to estimate, but it will come into play at some point as we scale up Hpc: operations that we need hardware and services and grid sides to support this data job flows at the

        [Enrico Fermi Institute] 16:07:51
        Hbc's

        [Enrico Fermi Institute] 16:07:53
        Because you didn't put on as a cost, but the payload cost.

        [Enrico Fermi Institute] 16:07:58
        So. In other words, the as we just heard Europe in the Us.

        [Enrico Fermi Institute] 16:08:03
        The next generation. Big machines. We'll have more and more accelerators is how the flop They're fun, you know.

        [Enrico Fermi Institute] 16:08:12
        It? Do you? Molly will have Cp only party on D cause for porting things to Gpu is was specifically excluded out of scope for The school. I understand but we have to explain that that is something that will probably have to be handled because that you know, obviously cms is because cpus are in your

        [Enrico Fermi Institute] 16:08:32
        trigger You guys are a little bit farther ahead than Atlas.

        [Enrico Fermi Institute] 16:08:36
        I mean, we will put that in as a component, but we're not going to put any effort level on it, because you can, because you don't know you don't. But it's not its goal, for this for this government it's not supposed to be its goal.

        [Enrico Fermi Institute] 16:08:49
        another strategic thing you could talk about here is what's common verses?

        [Enrico Fermi Institute] 16:08:59
        What's the experiment? Specific, hey? Yeah, Yeah, keeping it at the the leading order type things.

        [Enrico Fermi Institute] 16:09:08
        If we go through the presentations that find overlaps, then call out, because again, when it comes to cost, you need to think about how how the agencies view Hmm!

        [Enrico Fermi Institute] 16:09:23
        They they do like to see common activities.

        [Enrico Fermi Institute] 16:09:30
        you can't make things that are common, not common.

        [Enrico Fermi Institute] 16:09:33
        So you you It would be death to say everything is the same, because I think sure if I rescue for a baby, I'm happy.

        [Enrico Fermi Institute] 16:09:43
        But trying to to call that out can be a strategic way to help people look at the cost

        [Enrico Fermi Institute] 16:09:54
        Steve, I see your hands still up. Did you? Did you have another comment?

        [Steven Timm] 16:10:00
        no, I was no.

        [Enrico Fermi Institute] 16:10:02
        Alright on that last bullet. Oh, no!

        [Enrico Fermi Institute] 16:10:10
        This is us.

        [Enrico Fermi Institute] 16:10:23
        When you get to the report writing, I mean, if I had a better way to to state that doesn't have to be I mean. So So what do that I would highlight?

        [Enrico Fermi Institute] 16:10:33
        This Does have to be inquired. Sites, For example, if you think of the the spin work at at Ersk might be perfectly fine.

        [Enrico Fermi Institute] 16:10:43
        So I mean so. Is it not really about it? Services? No.

        [Enrico Fermi Institute] 16:10:51
        because if, for instance, you wouldn't need globus and all that, if the Wlcg data grid could talk as an equal nurse could be an equal member to the Wwlc: data grid, you would not have to do any sort of translation jump through, Hoop step if

        [Enrico Fermi Institute] 16:11:11
        Alcf had a gatekeeper or some other something equivalent that we could.

        [Enrico Fermi Institute] 16:11:18
        We could both submit jobs to with tokens that would be.

        [Enrico Fermi Institute] 16:11:22
        That's an example of an edge service that would be common development.

        [Enrico Fermi Institute] 16:11:25
        That would make the cost easier for that. But but that's that's I include that more in the technical integration and long-term maintenance, And that's stuff that's happened. I'll need at the hpc sites I would include there.

        [Enrico Fermi Institute] 16:11:41
        That's my properties. Last board is. Say that you have services at great sites is a solution 37.

        [Enrico Fermi Institute] 16:11:51
        You could turn that ball baby into additional operated services for Hpc.

        [Enrico Fermi Institute] 16:11:57
        As opposed to say, services at grid sites, but that is a dollar cost.

        [Enrico Fermi Institute] 16:12:03
        That money was spent. Yeah, Yeah, and it was to work around the deficiency.

        [Enrico Fermi Institute] 16:12:09
        But but the point is, does that not fall under the the prior to bullets?

        [Enrico Fermi Institute] 16:12:19
        It. What what I thought to include here, We'll have a discussion on that, later, because there's some integration, hypotheticals and impact on the rest of the collaboration.

        [Enrico Fermi Institute] 16:12:30
        It's more about like. Assume you have from a lab is a big star site for Cms in the Us.

        [Enrico Fermi Institute] 16:12:36
        And assume you put the difference between putting 50,000 extra Cpu.

        [Enrico Fermi Institute] 16:12:41
        Sorry me lab, and having fair 50,000 cpus somewhere else.

        [Enrico Fermi Institute] 16:12:46
        This network and kinda external data serving and transport links.

        [Enrico Fermi Institute] 16:12:51
        Okay. So it's especially, but in terms of capital equipment, I mean.

        [Enrico Fermi Institute] 16:12:56
        So what we could do to say Service operations for services, support, cost, and call that out separately from operations, support.

        [Enrico Fermi Institute] 16:13:05
        But if you're really thinking the hardware call hardware out separate That's that's a very different color of money.

        [Enrico Fermi Institute] 16:13:15
        That's hardware. The last bullet is is hardware.

        [Enrico Fermi Institute] 16:13:18
        I can tell you how much we spend. Yeah, So as I wrote the Rbt: Yeah, in that case, don't don't mix it in with certain.

        [Enrico Fermi Institute] 16:13:27
        Have have a hardware. Only bullet right?

        [Enrico Fermi Institute] 16:13:32
        And that that hardware potentially needs renewed right.

        [Enrico Fermi Institute] 16:13:36
        Of course, if we need it, you I need it. I mean what I mean is, if we need, if we continue to need it, we have to continue to fund it so I would just put that last one into at least 2 calls.

        [Enrico Fermi Institute] 16:13:47
        Yes, okay, yes, I think that was the last time we had for today.

        [Enrico Fermi Institute] 16:13:53
        That is, are you thinking at the end for any other strategic report?

        [Enrico Fermi Institute] 16:13:57
        On December or whatever to have a dollar range Here Is that the install, or just pointing out they considerations that need to be made and

        [Enrico Fermi Institute] 16:14:10
        We are specifically. We were discouraged from comparing Hpc.

        [Enrico Fermi Institute] 16:14:16
        Cloud cost 2 great costs, and it was a little bit of a I can force, but at the end that's the decision that was made.

        [Enrico Fermi Institute] 16:14:24
        So we should Just tried to come up with some cost on their own.

        [Enrico Fermi Institute] 16:14:29
        So with comparison. But I mean, are you saying for user facility? Like nurse?

        [Enrico Fermi Institute] 16:14:34
        We need between x

        [Enrico Fermi Institute] 16:14:40
        we'll put an Fde number different, depending on where, as an Unc.

        [Enrico Fermi Institute] 16:14:51
        Cost cost. Can you also phone? And should be also folded?

        [Enrico Fermi Institute] 16:14:54
        X amount of Cpu cores. Efficient running means.

        [Enrico Fermi Institute] 16:14:59
        Why amount of disc at the site, so that if we can't get the Y.

        [Enrico Fermi Institute] 16:15:05
        Amount of disk through the grant procedure, then that would actually be a cost, because you would have to do the condo model of buying storage. Well, that's why like just having a separate hardware bullet where the hardware sets you gotta you gotta I mean obviously you care. Where the

        [Enrico Fermi Institute] 16:15:26
        hardware sits, but they'll have, but there will be a a capital outlet

        [Enrico Fermi Institute] 16:15:32
        If this last part to the discussion this morning about data delivery, and having significant cash, or I did in point at the Hpcs.

        [Enrico Fermi Institute] 16:15:45
        If you wanted to do it that way. I don't mean to.

        [Enrico Fermi Institute] 16:15:49
        I guess the idea is that that would come through an allocation if it's part of the facility, right?

        [Enrico Fermi Institute] 16:15:53
        So maybe that is a department If they give us a storage, then it comes from the Yeah.

        [Enrico Fermi Institute] 16:15:58
        But if we get very little storage that puts a lot of pressure on a network and then storage somewhere else, because you have to be very.

        [Enrico Fermi Institute] 16:16:06
        She She can think of it this way. I get 500 pirates with my allocation, but I need a petabyte, And how do I make up the need the needs cap I Either make it up through filling up the stuff go streaming in.

        [Enrico Fermi Institute] 16:16:18
        And out, or I make a a buy storage at the side, and and is so on.

        [Enrico Fermi Institute] 16:16:28
        How much time you have to fill out. You can talk about the different types of costs and different example scenarios, because cause problem with.

        [Enrico Fermi Institute] 16:16:36
        So these things about caches, or you know, looking at it, site and it's a trade-off, you can say.

        [Enrico Fermi Institute] 16:16:42
        Well, if I put 200 TB on the site, I might say the extermination years.

        [Enrico Fermi Institute] 16:16:48
        but then obviously some sites. No, or I I I can, find a quote for what it takes which termites, own that expanse as an example, but usually usually about 8 or 5 storage.

        [Enrico Fermi Institute] 16:17:04
        Then Well, that's the problem. What what is Usually I can tell you I'm doing this.

        [Enrico Fermi Institute] 16:17:10
        I can tell you what the nurse allows you to pay by Give the money and do it, and some of their smaller sites.

        [Enrico Fermi Institute] 16:17:17
        That's in fact, how the Atp group got into the Lcrcs.

        [Enrico Fermi Institute] 16:17:22
        They have a condom try to check. They'll deploy it to.

        [Enrico Fermi Institute] 16:17:27
        That's be it, though, because storage is like a multi.

        [Enrico Fermi Institute] 16:17:31
        Your commitment? Or do you pay for Do you rent it?

        [Enrico Fermi Institute] 16:17:35
        You pay for you, you basically depends on it. It's usually, you know, for a quant of time which may be multi year, but at the end of the quanta bye bye, way up a couple of scenarios to avoid the fact that some of these are trade offs and it to communicate

        [Enrico Fermi Institute] 16:17:57
        But we prefer that it comes through the allocation process, because, indeed, application we lay out a use case, and we say, we can use this much Cpu And then But then we need that much storage to actually effectively use it?

        [Enrico Fermi Institute] 16:18:10
        So this would be a

        [Enrico Fermi Institute] 16:18:13
        Could not be a preferred choice that we have to buy storage. Gets into how much time you want to spend joining them scenarios.

        [Enrico Fermi Institute] 16:18:21
        There's a lot to write here. The Hpc facilities typically haven't had in their architecture something sitting there that's looking like my cash that's that's facing the white area. Network.

        [Enrico Fermi Institute] 16:18:33
        I. In other words, they have different ways of provisioning storage within.

        [Enrico Fermi Institute] 16:18:40
        But usually like we saw from like that nurse. If there's a big scratch disc, there's a there's other storage there I mean, there's the home file system the big scratch area. There didn't, seem to be is there something that's sitting on the edge, of

        [Enrico Fermi Institute] 16:18:53
        the network that could actually serve as a cache

        [Enrico Fermi Institute] 16:18:59
        I mean, the file system are connected. Get a data transfer, not to the outside, and that's a separate connection.

        [Enrico Fermi Institute] 16:19:05
        It's not internal, but that's usually high speed, so you can get in and out of there.

        [Enrico Fermi Institute] 16:19:11
        It's not visible on the onset, though. What's your budget

        [Enrico Fermi Institute] 16:19:17
        I think it's what Doug was saying

        [Enrico Fermi Institute] 16:19:21
        You 5 more switches. I remember the cash, so we'll say yes

        [Enrico Fermi Institute] 16:19:30
        Okay, any other comments from the Zoom

        [Enrico Fermi Institute] 16:19:38
        I think we're done. Thanks, everybody for slogging it out.

        [Enrico Fermi Institute] 16:19:43
        Yeah, So I think that's good, because we've I mean, we'll come back to Hpc at some of the later discussions.

        [Enrico Fermi Institute] 16:19:50
        But the focus tomorrow morning will be on. Yes, start with the cloud focus area tomorrow, and then in the afternoon we'll have networks, integration, hypotheticals, and R.

        [Enrico Fermi Institute] 16:20:04
        And D: Okay, Good. Thanks, everybody. We'll talk to you tomorrow.

        [Antonio Perez-Calero Yzquierdo] 16:20:09
        Thank you.

      • 14:30
        Discussion 30m
    • 10:00 12:00
      Second Day Morning: Cloud Focus Area

      Morning session [Eastern Time]

       

      [Kenyi Paolo Hurtado Anampa] 11:05:44
      Okay, So good morning. Everyone. Today, we will call queues.

      [Kenyi Paolo Hurtado Anampa] 11:05:51
      Okay, 2 different things. One is resources. This is going to be all of the morning session, and then in the afternoon we are going to talk mostly about networking and the assistant reference of what Hpc and Clouds and then R and D: So for the I would focus area, we

      [Kenyi Paolo Hurtado Anampa] 11:06:12
      will start with the just summarizing at the very high level.

      [Kenyi Paolo Hurtado Anampa] 11:06:17
      What Atlas and Cms have done, and the case of at last.

      [Kenyi Paolo Hurtado Anampa] 11:06:24
      Well, this is this has been okay. The the and shame they got a self contained your site, and this is they're they're linked to the Ukrainian, and they have their own screen Cdfs.

      [Kenyi Paolo Hurtado Anampa] 11:06:42
      We will This will be talking more detail in good next. Okay?

      [Kenyi Paolo Hurtado Anampa] 11:06:47
      And then for Pms: This is basically describing what it was done about 5 6 years ago.

      [Kenyi Paolo Hurtado Anampa] 11:06:55
      During the demo testing that Cms: did with production portfolios and the the the way this was done was by extending an existing Csi, and it more particularly the formula the resource with resources in the Cloud and This was done Via.

      [Kenyi Paolo Hurtado Anampa] 11:07:14
      He cloud again, this will be describing more detail in the next few sites, since this is since this was done this way, maybe in terms of production integration, we have the same reservations as Hpcs in terms of historic for work, on my chains which means that all data must be staged

      [Kenyi Paolo Hurtado Anampa] 11:07:36
      to existing sites.

      [Kenyi Paolo Hurtado Anampa] 11:07:42
      go on with the next slide, and this is for advice Fernando.

      [Fernando Harald Barreiro Megino] 11:07:48
      yeah, sure. Yeah, So this is the overview of what we are working on in Atlas.

      [Fernando Harald Barreiro Megino] 11:07:54
      So we have 2 main projects, the one on the left is on Amazon, and this comes through a Fresno from California State University.

      [Fernando Harald Barreiro Megino] 11:08:04
      And here we have basically, panda queue storage element, and also a squid, And those are always the 3 main cost components that we will have later with.

      [Fernando Harald Barreiro Megino] 11:08:13
      An it's also the the egos, and then the second part is Cook is the product that we have in.

      [Fernando Harald Barreiro Megino] 11:08:23
      Google it used to be. Us atlascentric But now, this year, since the middle of July, it became a worldwide, dotless project.

      [Fernando Harald Barreiro Megino] 11:08:35
      And so Atlas, is. A collaboration, is it's participating in the budget.

      [Fernando Harald Barreiro Megino] 11:08:42
      and here in this project, we do have a similar setup, us in Amazon with panda queue truth, storage element, and the squid.

      [Fernando Harald Barreiro Megino] 11:08:50
      But we also how we work like on a analysis facility prototype, with a 2 bitteran task.

      [Fernando Harald Barreiro Megino] 11:08:59
      so the integration of this of these cloud resources. We're done by the route team on the ponder team.

      [Fernando Harald Barreiro Megino] 11:09:07
      So we take a different approach done. If you are like trying to extend her side, and we we just generate our self-contained on cloud native side, in the case of truth, on the storage so it works in the way that to download the key for from Amazon or from

      [Fernando Harald Barreiro Megino] 11:09:28
      Google. And with that the key you can sign Url. And with the Url in the center Url, you say, you can upload a particular file until an hour from now, or you can download or Delete of and then this key needs to be put into ruthie and into fts so that they can generate

      [Fernando Harald Barreiro Megino] 11:09:47
      the assigned Url to with the downloads or the third party transfers for the compute path. It's all based on kubernetes and native integration.

      [Fernando Harald Barreiro Megino] 11:10:00
      There is particular. There is no nothing like a condor in the setup, and then we have Cvm Fs installed in the closer to our kubernetes planning That was one of the things that actually took most of the F to get at the very reliable

      [Fernando Harald Barreiro Megino] 11:10:18
      state, and then also the this quick part I mean, that's you can either run it in part as a part of the who could need this cluster in, Google for example, I just run load balance load balance the instance, great and the other thing that for the computer I always use is the outer

      [Fernando Harald Barreiro Megino] 11:10:41
      scaling. So when there are no jobs cute, for example, the panda compute part.

      [Fernando Harald Barreiro Megino] 11:10:48
      It shrinks to a minimum, and then, if you submit a lot of tops it, the the cluster grows up to, or the limit, or as much as needed for hosting all of the jobs yeah the the setup, is, it's not bound to any particular cloud provider

      [Fernando Harald Barreiro Megino] 11:11:07
      it's just Stanford protocols and technology.

      [Fernando Harald Barreiro Megino] 11:11:10
      So you can in principle use the same setup in other cloud.

      [Fernando Harald Barreiro Megino] 11:11:13
      Providers. For example, I tried out the dependent part one time in in Oracle Cloud, just to see that it works.

      [Fernando Harald Barreiro Megino] 11:11:22
      Yeah, then in the next slide, please.

      [Fernando Harald Barreiro Megino] 11:11:27
      So one of the things that you can exploit on all of these commercial clouds, is all the different types of architectures that they have and that you don't always have on on grid sites.

      [Fernando Harald Barreiro Megino] 11:11:41
      One particular example is on Amazon. We were doing some arm testing.

      [Fernando Harald Barreiro Megino] 11:11:47
      So for this case it was Johannes and the teenager team that were trying to build the Athena simulation software for arm 64.

      [Fernando Harald Barreiro Megino] 11:11:58
      They had done the building, and they wanted to do a small physics.

      [Fernando Harald Barreiro Megino] 11:12:01
      Validation, or run a of a whole task. With that.

      [Fernando Harald Barreiro Megino] 11:12:04
      But there was not really any volunteer, any available grid site with arm resources that could set that up.

      [Fernando Harald Barreiro Megino] 11:12:12
      So what we did is we set it up in in Amazon with the cravat on 2 notes in the in the right side diagrams.

      [Fernando Harald Barreiro Megino] 11:12:25
      It's just the first validation that you honest it with 10,000 events.

      [Fernando Harald Barreiro Megino] 11:12:29
      And he compared the X 86 that had been executed in track.

      [Fernando Harald Barreiro Megino] 11:12:32
      I believe, against the arm, 64 on arm as an end.

      [Fernando Harald Barreiro Megino] 11:12:36
      It was it was matching quite well. And then, some weeks later, we prepared the full physics, validation with a 1 million events, and that was fully signed off few weeks ago.

      [Fernando Harald Barreiro Megino] 11:12:49
      So in principle, simulate simulation could be executed.

      [Fernando Harald Barreiro Megino] 11:12:57
      like in standard. Production. Now and I mean, we don't do this in particular for the cloud, but we do it more like it was discussed yesterday in the Hpc.

      [Fernando Harald Barreiro Megino] 11:13:06
      Session, where most of the next generation Hpcs are going to come up with a with more on Cpus and X.

      [Fernando Harald Barreiro Megino] 11:13:17
      86 is going to be dominant so it's a preparation for that.

      [Fernando Harald Barreiro Megino] 11:13:20
      Other things, so other exotic architectures or resources that can be Houston. Huh!

      [Fernando Harald Barreiro Megino] 11:13:27
      Thanks that can be used in in the cloud. For example, there is a user that is doing some trigger studies for a filter and there he's using Fpgas on Amazon or Johann for building the software for He uses very large notes on on Amazon and

      [Fernando Harald Barreiro Megino] 11:13:47
      Google and also Cpu stuff. Next slide, please, And if anyone has a question or comment while I'm going through the slice, you can interrupt me.

      [Fernando Harald Barreiro Megino] 11:14:01
      Now we come on, Google, just running Google as a great site, you can see 2 different approaches on the right top plot.

      [Fernando Harald Barreiro Megino] 11:14:14
      You can see how we were doing like scalar test.

      [Fernando Harald Barreiro Megino] 11:14:17
      That was done during the previous funding around until we were trying to see how how far we can scale it in a single client region, and we were getting to a 100,000 course in Europe West one P.

      [Fernando Harald Barreiro Megino] 11:14:35
      Which is an one of the one of the European regions, and if you would want to scale this out even more, you could replicate the setup to to whatever, to us to multiple regions in Europe, and so on and reaching a very high number of costs what we are doing now in

      [Fernando Harald Barreiro Megino] 11:14:56
      since it's fully worldwide Atlas Project is, we're running at the moment, a fixed, dry, a fixed size grid side.

      [Fernando Harald Barreiro Megino] 11:15:05
      We started with 5,000 calls, and we moved it to 10,000 costs.

      [Fernando Harald Barreiro Megino] 11:15:10
      It's exactly a month ago, and we can run any type of production.

      [Fernando Harald Barreiro Megino] 11:15:16
      we are not running analysis at the moment, because we need to reorganize the the storage.

      [Fernando Harald Barreiro Megino] 11:15:23
      In particular, we need a data discontent, separate, stretch disk so that user outputs don't end up in the same storage element.

      [Fernando Harald Barreiro Megino] 11:15:32
      but the other one Is this: The discrete site has worked very well.

      [Fernando Harald Barreiro Megino] 11:15:38
      It's very reliable, and also a very low error rate.

      [Fernando Harald Barreiro Megino] 11:15:41
      And when the the errors are usually very focused on particular situations, like, for example, I great to machines with low disk, or one at the time, there were issues with the with the phone tasks and I had, to fix that and our goal, is to do like a mix of

      [Fernando Harald Barreiro Megino] 11:16:02
      both both versions like, mix the the on demand fast scale out with a fixed size.

      [Fernando Harald Barreiro Megino] 11:16:12
      So we plan to run more or less. A a flat queue with 5,000 cores, and then on top run a dynamic queue, which processes urgency requests.

      [Fernando Harald Barreiro Megino] 11:16:24
      Or we are going to do something that we call the full chain where all of your steps in a in a simulation in our production, chain I run inside the same resource and you don't export the you only export the final in order to reduce the egress, cost

      [Fernando Harald Barreiro Megino] 11:16:48
      yeah, and the next slide. Thanks, Kenny. The other thing that we tried out is this analysis facility prototype?

      [Fernando Harald Barreiro Megino] 11:16:56
      what we wanted to do is like task scaling evaluations.

      [Fernando Harald Barreiro Megino] 11:17:03
      So we installed Twitter and task on on Google.

      [Fernando Harald Barreiro Megino] 11:17:08
      We integrated it with the Atlas Am. So in anyone from Atlas can connect without needing to question in particular new account or anything.

      [Fernando Harald Barreiro Megino] 11:17:20
      And then we have a couple of different options that the user can select for tasks.

      [Fernando Harald Barreiro Megino] 11:17:27
      You use this first light version, but then we also have machine learning images.

      [Fernando Harald Barreiro Megino] 11:17:31
      so that other people use tensorflow, and all those libraries, and you can also, if you want to put notebook with a cpu, and that will take a little moment to to put you need to provision the the machine.

      [Fernando Harald Barreiro Megino] 11:17:50
      You need to install Cvmfs and load Cmfs, and then added to the to the cluster.

      [Fernando Harald Barreiro Megino] 11:17:57
      That takes a couple of minutes. But then you have a notebook with a Gpu just for yourself, and you can without as long as you need, and for the task part which is, in my opinion, a very good example, for great scalar, for cloud scalability the right lower plot was

      [Fernando Harald Barreiro Megino] 11:18:20
      and look at Signage, who was trying out running the same task, but with a different number of workers.

      [Fernando Harald Barreiro Megino] 11:18:27
      So he ran first with 100 workers, and it took 40 min.

      [Fernando Harald Barreiro Megino] 11:18:30
      Then he re rerun the same task with the 200 workers, and the duration was half and so we'll until the the last part where he uses 1,500 workers and the task is done within just a few minutes and the the thing about this is that the cost on the cloud is

      [Fernando Harald Barreiro Megino] 11:18:51
      roughly the same, except for maybe they're just scaling or scheduling overhead.

      [Fernando Harald Barreiro Megino] 11:18:57
      but the cost is roughly the same. If you run way very few workers, and if you run with a lot of workers and for the use himself, it makes a lot of difference if he gets the results in 1 h or in 5 min, and yeah, we also should consider the in the cost the

      [Fernando Harald Barreiro Megino] 11:19:19
      calculation with the salary of the of the user himself, since he's optimizing his time. A lot.

      [Fernando Harald Barreiro Megino] 11:19:27
      Yes, and that's it. Can you? For next night.

      [Kenyi Paolo Hurtado Anampa] 11:19:32
      it's from then Yes, and so then for Cms again.

      [Kenyi Paolo Hurtado Anampa] 11:19:37
      This is what it was done a few years ago, and again, as I mentioned before, we did this by integrating cloud resources in one of the their sites at the Fermi website.

      [Kenyi Paolo Hurtado Anampa] 11:19:51
      via head cloud. So do you basically have a workflow injected.

      [Kenyi Paolo Hurtado Anampa] 11:19:55
      which is the resource provisioning trigger.

      [Kenyi Paolo Hurtado Anampa] 11:19:58
      This enters the facility interface which talks to the authentication and authorization mechanisms.

      [Kenyi Paolo Hurtado Anampa] 11:20:04
      Then the decision. There is a decision engine and a facility pull.

      [Kenyi Paolo Hurtado Anampa] 11:20:09
      There. And the decision engine basically talks to a provisioner that will be talking to the Microsoft.

      [Kenyi Paolo Hurtado Anampa] 11:20:16
      The same day cloud. And so this is basically a diagram of the head cloud architecture.

      [Kenyi Paolo Hurtado Anampa] 11:20:22
      What you have from there is basically going to restart this in the local sources in the cloud.

      [Kenyi Paolo Hurtado Anampa] 11:20:34
      So you have connecting to the HD. Calendar schedulers in the gliding wms in procedure, And that's How everything.

      [Kenyi Paolo Hurtado Anampa] 11:20:43
      Is connected in this case

      [Kenyi Paolo Hurtado Anampa] 11:20:53
      Okay. And the next part is Lans. You know. I think they're talking.

      [Dirk] 11:21:01
      yes, so Lanceium was already mentioned yesterday. It's it's an interesting new.

      [Dirk] 11:21:11
      Does that new company, and they because they they're not like a your traditional full service cloud provider that sell you that basically operate worldwide and give you anything you want in terms of capabilities And instance, types And whatever there they really geared towards utilizing low cost renewable energy

      [Dirk] 11:21:35
      to provide cheap compute, basically. And they're almost like a part of the business model is almost like an energy utility.

      [Dirk] 11:21:42
      basically they get money for for being able to low chat, and and they're they.

      [Dirk] 11:21:47
      They construct They're constructing the data centers right now in in in areas with that very high on renewable wind energy, and we we did a test a few months back where we integrated them into production.

      [Dirk] 11:21:59
      We run a few small workflows. It was all on free cycles as a as a test.

      [Dirk] 11:22:05
      Basically they they have a bit different than aws. Google. They only support singularity containers, not vms.

      [Dirk] 11:22:11
      And what we did is we just ran a pilot job in the singularity container, and then then the pilot itself is just to stand that Cms pilot.

      [Dirk] 11:22:21
      So it runs our payloads in in in a nested singularity container, you know.

      [Dirk] 11:22:26
      Cvm. Fs and local squid were provided from Nancy.

      [Dirk] 11:22:30
      We we work with them on that. They currently don't have any local managed storage.

      [Dirk] 11:22:34
      Just so job scratch. So in, And we basically run these resources like we do Opportunistic was G.

      [Dirk] 11:22:39
      Or Hbc. Site set where we don't use manage to storage.

      [Dirk] 11:22:42
      We just used triple a reads to get the input and then stage out to formula.

      [Dirk] 11:22:48
      So that's that covers the runtime.

      [Dirk] 11:22:50
      The provisioning integration is another problematic area potentially for long term, because they have a custom api, which is not compatible with aws or Google I mean it already They're running singularity containers so you need some way to start up.

      [Dirk] 11:23:05
      A container, and what we're doing right now is, we're just using vacuum provisioning.

      [Dirk] 11:23:10
      So we just when we run to run a test, we just start up a container manually when needs.

      [Dirk] 11:23:15
      And that's relatively simple through the Api, because the Api is just.

      [Dirk] 11:23:19
      You can. You can run like some script that that call out to the Python Api.

      [Dirk] 11:23:25
      Call out to the Api and tell it how many containers are running.

      [Dirk] 11:23:28
      If it's less than 10, you bring it up to 10 So that's that's that's basically the level of integration that we have right now.

      [Dirk] 11:23:36
      So we think it's interesting enough. It will cry a little bit of work to to get really working to get it fully integrated.

      [Dirk] 11:23:44
      But we're working on with with Lansing pretty occurring A small number of cycles for more tests.

      [Dirk] 11:23:51
      Is it? The plan is maybe to to get some cycles there and then see if we can.

      [Dirk] 11:23:57
      When there's particular load from Cms specifically on Fermi app, that we can say, Okay, we bring up lensing resources and that freeze up resources at formula to do stuff that is most suited to a tier.

      [Dirk] 11:24:10
      One

      [Enrico Fermi Institute] 11:24:28
      just to get in there for a second in the case of all of this, this is very much oriented around production jobs. Do you think we could organize sometime in you know, the next year or something like that trying to interface this with either coffee Casa or the elastic analysis facility?

      [Enrico Fermi Institute] 11:24:45
      Effort to see if we can gain more flexibility for more bursty analysis jobs, much like what Atlas was doing with Google Cloud and whatnot

      [Dirk] 11:24:56
      we could try I mean the

      [Enrico Fermi Institute] 11:24:58
      The security is gonna be a nightmare at elastic analysis.

      [Dirk] 11:25:03
      Yeah, the the the thing is it? It really depends how well everything plays together with the provisioning integration.

      [Enrico Fermi Institute] 11:25:10
      Facility for sure

      [Dirk] 11:25:10
      I mean, they have a simple Api. They just pass it.

      [Enrico Fermi Institute] 11:25:10
      Yeah.

      [Dirk] 11:25:13
      You basically you need a token associated with your account, and then you have, a if you're a single python script, like a monolithic python script that they give you where you can tell it to start a container and bring up something so so it's it's

      [Enrico Fermi Institute] 11:25:25
      Okay.

      [Dirk] 11:25:27
      relatively simple, so sure, I mean, we can look at it It's it's a matter of Do you want to do it?

      [Enrico Fermi Institute] 11:25:32
      I mean

      [Dirk] 11:25:35
      Do you want to do tests, or you want to do it for real?

      [Dirk] 11:25:37
      Because when if you do it for real, then you actually need to have a paid for a number of cycles sitting there that you can use for tests, we can just go whenever

      [Enrico Fermi Institute] 11:25:46
      Yeah, I I think we would need to get the facilities at the or the at least the one affirmative action, as it is right now and then go for a more for real test with people's actual, analysis.

      [Enrico Fermi Institute] 11:25:57
      Jobs. Once we have that set up. I think that would be the better way to see how this actually works.

      [Enrico Fermi Institute] 11:26:06
      So this is like a year, time scale, or something like that.

      [Enrico Fermi Institute] 11:26:09
      Your analysis, facility. Do you have any implicit dependencies on shared file systems, or anything like that? Or does everybody because we're at or because we're a fermi, lab we're restricted from using shared shared file systems aside from like, X d And stuff, Okay, yeah, I was gonna

      [Enrico Fermi Institute] 11:26:24
      say that might be 1 One challenge. Is stretching out to right is, How do you structure file system out there?

      [Dirk] 11:26:29
      They.

      [Enrico Fermi Institute] 11:26:29
      Exactly, but thankfully. We've already been forced to solve that

      [Dirk] 11:26:33
      Maybe because Lindsay just mentioned a year, the the time horizon on that at the currently Lens Zoom.

      [Dirk] 11:26:41
      As I said, the young company need a starting up. They're kinda still building the data sent us like the the, the So they have a test data center.

      [Enrico Fermi Institute] 11:26:45
      Hmm.

      [Dirk] 11:26:51
      That's in Houston, which is not really using renewable energy.

      [Dirk] 11:26:54
      But where they basically just deploying the the the whole hardware software integration that they're working with, and that's what we've been testing, what they're building right now, and which is supposed to come online at some some point later, this year, or early next year, i'll leave really the big data centers

      [Dirk] 11:27:08
      which are co-located to like wind, energy, hotspots, and and and Texas.

      [Dirk] 11:27:15
      There's not much else there, but they're building a data center and they'll be the interesting.

      [Dirk] 11:27:20
      Ones basically because that's real, renewable in a jet.

      [Dirk] 11:27:22
      This lots and lots of basically power capacity there. And they they call more importantly, they're gonna connect them to a 100 gigabit to the Us.

      [Dirk] 11:27:33
      Net and everything else

      [Enrico Fermi Institute] 11:27:34
      Okay, they're actually going to appear to. They're going to actually can connect Peer with the snap for sure.

      [Dirk] 11:27:41
      That's what their plan is, because they're they are kind of pushing They're they're basically making the sales pitch hard to academic users.

      [Dirk] 11:27:51
      I mean, I've seen talks for them on Scc. Os was G.

      [Dirk] 11:27:56
      Did they basically travel around Europe Because for Europe it's it's like running compute on cheap powers and even bigger concern right now than in the Us.

      [Dirk] 11:28:06
      because power prices there traditionally have been much higher.

      [Dirk] 11:28:09
      And now I extremely much higher than than in the Us.

      [Enrico Fermi Institute] 11:28:12
      But they're gonna connect to Esnet versus Internet to

      [Dirk] 11:28:17
      Probably I mean they. They said. They're they're really. They.

      [Dirk] 11:28:20
      They basically point out, that's that's another of this selling points.

      [Dirk] 11:28:24
      They point out that they want to not charge, for egress.

      [Dirk] 11:28:32
      So like yeah, like not charging for egos in good network.

      [Dirk] 11:28:39
      Integration for academic workloads seems to be they're they're they're focusing on that, I mean, because I mean, you have to look at what the they are low quality of service.

      [Dirk] 11:28:50
      Somewhat by design. They're not like Amazon ready, Sell you to give you a Vm.

      [Dirk] 11:28:55
      And promise you 99 point whatever. David. Then Zoom tells you.

      [Dirk] 11:28:59
      If the if the room, there's no wind, we're gonna load chat like crazy.

      [Dirk] 11:29:04
      So we're gonna evacuate you, and that's that's fine.

      [Dirk] 11:29:08
      but that also means that they have to have other selling points, because and other target markets, because they're not gonna attract the like the financial sector or industry.

      [Dirk] 11:29:18
      That wants like high up time, compute service, sitting somewhere

      [Ian Fisk] 11:29:22
      But but the turkey thing was, is net not Internet too right?

      [Dirk] 11:29:26
      I'm I'm not exactly sure they They basically we had.

      [Ian Fisk] 11:29:28
      Oh sure!

      [Dirk] 11:29:30
      And and you have to remember the discussions we had with them.

      [Ian Fisk] 11:29:32
      Right.

      [Dirk] 11:29:34
      But most months before these data, they're still under construction, so I don't think the network is connected yet.

      [Ian Fisk] 11:29:38
      Right. The only reason I ask is that it is one of the things that yes, Net Charter is very is relatively strict, and it will allow you to connect formula or Bnl or cern to land team resources but for instance, it won't carry from a university so one of the

      [Dirk] 11:29:39
      Nope.

      [Ian Fisk] 11:29:58
      endpoints needs to be under the Es net charter, which limits we can be a little limiting

      [Dirk] 11:30:04
      I mean at the moment I would. Read It as a a statement of intent that they want to do everything they can on the network integration side to make it easy for us to to to use their facilities and then what's actually deployed on how things are connected I think we have to wait for for these data

      [Ian Fisk] 11:30:08
      Okay.

      [Ian Fisk] 11:30:15
      Right.

      [Ian Fisk] 11:30:22
      Right, and it could It's fine that that matches under the charter.

      [Kenyi Paolo Hurtado Anampa] 11:30:50
      yes. So the next section in this license about cloud costs, and just like yesterday they green.

      [Kenyi Paolo Hurtado Anampa] 11:31:04
      Here means this is one of the questions that we need the from the charge that we need to answer.

      [Kenyi Paolo Hurtado Anampa] 11:31:08
      And it's basically what is the total cost of operating commercial power, resources collaboration workflows.

      [Kenyi Paolo Hurtado Anampa] 11:31:15
      And this is mostly focused on production workforce for the charge, both for computer resources as well as the operational effort for the Lc.

      [Kenyi Paolo Hurtado Anampa] 11:31:24
      Around 3, and so with that we will start with, what's the experience software in the cost for Atlas and Cms? So for another, do you Wanna take over from here?

      [Fernando Harald Barreiro Megino] 11:31:38
      yes, so I mean the content of this light is mostly my opinion and experience.

      [Fernando Harald Barreiro Megino] 11:31:47
      and now just to note that with with this Atlas covid project there will be a dedicated Tc.

      [Fernando Harald Barreiro Megino] 11:31:59
      Also total cost of ownership board that will then study the costs detail.

      [Fernando Harald Barreiro Megino] 11:32:08
      so to explain a little bit the the cost model, so that you have in the cloud.

      [Fernando Harald Barreiro Megino] 11:32:13
      So for the computer are different levels of of the virtual machine.

      [Fernando Harald Barreiro Megino] 11:32:20
      So you have the reserved instances What you say basically I will.

      [Fernando Harald Barreiro Megino] 11:32:24
      I reserve for so many cpus for a year, and but that also means that you are stuck for the full year with those reserved instances, and there is no real elasticity.

      [Fernando Harald Barreiro Megino] 11:32:38
      Then there is the on demand which is the appetite for for the on-demand virtual machines, where, when you want to request, a virtual machine, and then, once you have it that way too much, for you for like forever and then the lower t of the on the month is the spot

      [Fernando Harald Barreiro Megino] 11:32:58
      where you will request a virtual machine, and then can get it, and you can also.

      [Fernando Harald Barreiro Megino] 11:33:06
      It can also be taken away from you whenever Google needs it.

      [Fernando Harald Barreiro Megino] 11:33:09
      For someone else, or it can be just their cute and put somewhere else if they need to do some optimization within their computing center.

      [Fernando Harald Barreiro Megino] 11:33:24
      Yeah, it's I think it's 37. So it's not.

      [Fernando Harald Barreiro Megino] 11:33:31
      It's nothing, but we can in practice work with.

      [Fernando Harald Barreiro Megino] 11:33:34
      So if you get the kill signal, you lost the whatever, so was running on the Vm.

      [Fernando Harald Barreiro Megino] 11:33:40
      Until then.

      [Fernando Harald Barreiro Megino] 11:33:44
      but the experience with Spot. It's quite good, in my opinion, because the in Google in particular also have only the preemptable vms which had the maximum lifetime of 24.

      [Fernando Harald Barreiro Megino] 11:33:59
      Hours and now they've stopped that model, and they've moved to Spot, and there you can have the the virtual machines for a long time.

      [Fernando Harald Barreiro Megino] 11:34:07
      and I don't see a significant amount of failed, wasted world clock time because of spot and got this.

      [Fernando Harald Barreiro Megino] 11:34:16
      Well, you will see it later, but it's like 60% cheaper than on the month.

      [Fernando Harald Barreiro Megino] 11:34:22
      Then for the storage you have also different categories.

      [Fernando Harald Barreiro Megino] 11:34:27
      There is the Standard number line call line, like the the time to access your files.

      [Fernando Harald Barreiro Megino] 11:34:33
      Is the same with all of them, with with the th.

      [Fernando Harald Barreiro Megino] 11:34:38
      The Then you go to the right. You need to keep the data on the storage for longer, So I think, Neil, and you need to keep it 30 days cold. Line.

      [Fernando Harald Barreiro Megino] 11:34:51
      I don't know how many days I'm so.

      [Fernando Harald Barreiro Megino] 11:34:52
      Also the the question that you get, the more you pay by the access.

      [Fernando Harald Barreiro Megino] 11:34:58
      So in practice for the storage, we are always using standard the standard class.

      [Fernando Harald Barreiro Megino] 11:35:05
      So that's a traditional course model. Now, there is a new cost model.

      [Fernando Harald Barreiro Megino] 11:35:09
      That is what we are using in, and it's this: Us public sector subscription agreement with, Google And they're basically, if you are university or or a lot from the public sector, you can negotiate a fixed price for your computing needs So agreement

      [Fernando Harald Barreiro Megino] 11:35:31
      for 10,000. See total cpus and 7 petabytes of storage, and it's for $600,000 for 15 months.

      [Fernando Harald Barreiro Megino] 11:35:48
      and you pay that amount, and you don't need to worry.

      [Fernando Harald Barreiro Megino] 11:35:51
      If then you have more egress or less ears to optimize your, you're agreement, obviously.

      [Fernando Harald Barreiro Megino] 11:35:58
      And this protects you from surprises Under the end of your 15 months.

      [Fernando Harald Barreiro Megino] 11:36:04
      I guess it will be a review like while you, your egress, is out of control to try tomorrow or not, or maybe they are happy with the situation, and they get renegotiated and I don't want to talk about the exact amount of dollars that we have in the in our agreement.

      [Fernando Harald Barreiro Megino] 11:36:24
      But I just want to say that it's very favorable, and it's lower than that this prices the under thing to consider in in these clouds is that the resources are very elastic.

      [Fernando Harald Barreiro Megino] 11:36:41
      So it's a bit what they try to show with in the task Example: the cost.

      [Fernando Harald Barreiro Megino] 11:36:45
      For 10,000. Cpus for and 1 h. It's the same as the cost of for one Cpu.

      [Fernando Harald Barreiro Megino] 11:36:50
      For of 10,000 h, so you can run the real estate without a major cost increase, and also, since it's the last like, if you ramp down, you don't need to keep any other resources, and you just really pay for what you use and from an operational perspective.

      [Fernando Harald Barreiro Megino] 11:37:17
      in in my opinion, it's very low cost. There is, I mean, the whole development setup on operation of the of the Lutheran part was done by one of the route experts and also all of the development setup and operation for the Underpassed by one panel expert fractions, of the of

      [Fernando Harald Barreiro Megino] 11:37:38
      the right, And I also think that here this model is like really pure devops for the most pure form of it.

      [Fernando Harald Barreiro Megino] 11:37:49
      You operate aside. You see, I'm you also learned things that are not good for the site that you can improve in in Fanda or in harvest.

      [Fernando Harald Barreiro Megino] 11:37:59
      And then you go and change those things, and also with the same amount of Fd.

      [Fernando Harald Barreiro Megino] 11:38:08
      Resources like, if I run a 10,000 core. Cluster, 30,000 core class that doesn't really make up difference.

      [Fernando Harald Barreiro Megino] 11:38:18
      I'm now moving to the plot on the right, so here what I'm showing is the all of the pins, except the last one to the right are simulations using the cost you later.

      [Fernando Harald Barreiro Megino] 11:38:34
      the first one are on Amazon, the second ones are on Google. Good.

      [Fernando Harald Barreiro Megino] 11:38:38
      and I didn't average of the Usda tools and play so that there are in average it's 10,000 data cpus.

      [Fernando Harald Barreiro Megino] 11:38:48
      Of course, 7 petabytes of of storage

      [Fernando Harald Barreiro Megino] 11:38:55
      I don't know what she keeps writing, and then also an average.

      [Fernando Harald Barreiro Megino] 11:39:04
      There was 1 point, 5 petabyte of egress per month.

      [Fernando Harald Barreiro Megino] 11:39:06
      I looked it up in the dtmashboard, and then I went to the Google price calculator and I using different types of oh Pms.

      [Fernando Harald Barreiro Megino] 11:39:17
      I calculated the cost. The blue part is the the cpu.

      [Fernando Harald Barreiro Megino] 11:39:21
      The red part is the storage. 7 petabytes, and the yellow part is the 1.5 kB of egress per month, and then well so depending on what the type of compute you use you pay you can reduce it so this is the first one is on

      [Fernando Harald Barreiro Megino] 11:39:38
      the demand. Second one is if you pay one tier upfront on Amazon, then if you reserve for a year on Amazon.

      [Fernando Harald Barreiro Megino] 11:39:46
      But you don't pay upfront, and it's a little bit more expensive.

      [Fernando Harald Barreiro Megino] 11:39:49
      And then you reserve for 3 years, and then you see that the price starts dropping considerably, and the last one for Amazon is the Amazon spot, and you see that the the Cpu part is really much lower than the which is the Amazon on demand.

      [Fernando Harald Barreiro Megino] 11:40:08
      Then, if we move to the Google part, Google is a little bit, She preferred the Cpu in at least for for the calculations that I did for the egress and the storage, is more or less the same, as amazon and then the very last pin that I took on

      [Fernando Harald Barreiro Megino] 11:40:30
      the billing report of of the Google Cloud console for the last 30 days under extracted how much we have been spending on each one of the things and to compare it with what I had done in my theoretical calculations the Cpu was a little bit cheaper So

      [Fernando Harald Barreiro Megino] 11:40:50
      we use Spot. So you have to compare it with the Gcp.

      [Fernando Harald Barreiro Megino] 11:40:53
      Spot. It's a little bit cheaper. Also I didn't use the full 10,000 cpus, but only in my 2 9,200 for the story to It's much cheaper than the others, but also we don't have the 7 petabytes. Of data.

      [Fernando Harald Barreiro Megino] 11:41:12
      Yet We have only 1.6 bit byte. So that's that explains it.

      [Fernando Harald Barreiro Megino] 11:41:16
      And then egress. We are at. We did 1.2 Pedros of E address, according to the Tcp Billing, which is very close to what I had gotten in my models?

      [Fernando Harald Barreiro Megino] 11:41:28
      and so that's what you would be paying if you would pay list prices.

      [Fernando Harald Barreiro Megino] 11:41:33
      But again in our user agreement, it's the the what we would keep paying effectively. It's it's lower than that on the this is it for this night.

      [Fernando Harald Barreiro Megino] 11:41:46
      I see there are Yeah.

      [Paolo Calafiura (he)] 11:41:52
      quick, quick, question actually It's a call man by the question.

      [Paolo Calafiura (he)] 11:41:57
      the subscription price is, of course, very advantageous, but it does.

      [Paolo Calafiura (he)] 11:42:03
      It does kind of remove the elasticity you mentioned, because you know, if you use one Cpu for 10,000 h, you are not using your subscription very well at all, so that that that, was the only comment that I wanted to make and then I and follow our question is anyone on top with with

      [Fernando Harald Barreiro Megino] 11:42:16
      Hmm.

      [Paolo Calafiura (he)] 11:42:22
      Amazon about a model similar to this Google subscription?

      [Fernando Harald Barreiro Megino] 11:42:28
      So about elasticity, not completely because the agreement is for 10,000 little cpus in average, so you could be using one month, 5,000 on the next month, 15,000 But yeah.

      [Fernando Harald Barreiro Megino] 11:42:46
      If you arrive on the last day, and want to use your average 10,000 digital Cpu: so 15 months on the last day, that will be very difficult.

      [Fernando Harald Barreiro Megino] 11:42:57
      But the zoom, with your your resources, So there's some elasticity there is, and about the Amazon question, I I don't.

      [Paolo Calafiura (he)] 11:42:57
      yeah, I did.

      [Kaushik De] 11:43:19
      yeah, we we. We have not had the conversation with Amazon with Amazon.

      [Kaushik De] 11:43:24
      We only use credits in the old traditional way.

      [Kaushik De] 11:43:33
      So in some sense it's good because we have the side by side comparison with Amazon via I'll set them out to fix credit.

      [Chris Hollowell] 11:43:56
      I yes, it. You know from my experience, the the a lot of the Cloud providers.

      [Chris Hollowell] 11:44:01
      They're not really guaranteeing a specific Cpu model.

      [Chris Hollowell] 11:44:05
      It's sort of nebulous what cpu they're they provide.

      [Chris Hollowell] 11:44:09
      So I mean, I guess the question is, you know, so you have 10,000 chorus.

      [Fernando Harald Barreiro Megino] 11:44:21
      they do not, they do not tell you exactly what's the Cpu model, but some family.

      [Fernando Harald Barreiro Megino] 11:44:31
      So, for example, I used the N. 2, and that is cause Kate Link or Ice Lake, I think, would I'm not the Cpu expert but th th those are for Google the newer generations.

      [Fernando Harald Barreiro Megino] 11:44:44
      And if you take the n one you go to the order generations.

      [Fernando Harald Barreiro Megino] 11:44:46
      So, yeah, it's you. You don't. Yeah, you're you're more or less right.

      [Fernando Harald Barreiro Megino] 11:44:55
      That don't know exactly what the Cpu. That's an approximation

      [Steven Timm] 11:44:59
      Oh, you think hmm

      [Enrico Fermi Institute] 11:45:00
      Do they not expose anything into the in the Os

      [Enrico Fermi Institute] 11:45:05
      Okay.

      [Steven Timm] 11:45:06
      So it Yeah, I had my student actually run this house for most of the new Google instances this summer.

      [Steven Timm] 11:45:18
      I have the numbers, we we we've got most of the Google aspects available.

      [Steven Timm] 11:45:22
      I think we want some

      [Fernando Harald Barreiro Megino] 11:45:24
      I would be interested in having that

      [Chris Hollowell] 11:45:26
      right, right.

      [Steven Timm] 11:45:37
      okay.

      [Chris Hollowell] 11:45:39
      I guess the issue, there though is, since they're not guaranteed any Cpu model.

      [Chris Hollowell] 11:45:44
      In particular, there would change

      [Enrico Fermi Institute] 11:45:57
      Here we had a comment from Dirk

      [Dirk] 11:46:04
      one was about the elasticity which was already covered, so it seems to be possible within within limits.

      [Dirk] 11:46:12
      But you probably, if you have a 10,000 average, you can't run 120,000 like one month, and nothing the rest of the year.

      [Enrico Fermi Institute] 11:46:16
      Thanks.

      [Dirk] 11:46:18
      That's probably not gonna fly. Let's see.

      [Dirk] 11:46:22
      Seems to me but the the other one was, I think, I mean, we talked about these pricing plots a a a bit.

      [Dirk] 11:46:28
      I think I finally I think, understood what that last bomb means.

      [Dirk] 11:46:35
      So that's from the within the subscription. That's from the running counter insight.

      [Dirk] 11:46:40
      So it's in some sense fake pricing, right?

      [Dirk] 11:46:42
      It's because you you pay the subscription price, but they still tabulate what things cost?

      [Fernando Harald Barreiro Megino] 11:46:48
      yeah. So with the subscription, what? And they are doing is all the time filling up our credit.

      [Dirk] 11:47:01
      Quote unquote.

      [Dirk] 11:47:01
      Okay.

      [Ian Fisk] 11:47:11
      Yup, my! It was a question actually which was It is as interesting as to see the various models between the 2 great cloud providers.

      [Ian Fisk] 11:47:19
      Has anyone done an updated what it's actually costing us to host these things?

      [Ian Fisk] 11:47:24
      Because I'm looking at these numbers, and I I know the size of the facility that we run versus the cost of the hosting and the operations. And these numbers are dramatically higher than we're paying

      [Ian Fisk] 11:47:44
      So I'm putting in that right bye, bye, I

      [Ian Fisk] 11:47:49
      Okay, I am, including in that price the cost of the hosting.

      [Ian Fisk] 11:47:53
      So what it costs to rent the space, to power, the machines to buy the machines, to operate, the machines, to administer the machines and support people, using the machines

      [Fernando Harald Barreiro Megino] 11:48:07
      but then everything that is installed on top is, it's not included this.

      [Ian Fisk] 11:48:11
      Oh, really! What what, What do you mean? What's installed on?

      [Fernando Harald Barreiro Megino] 11:48:16
      All the services that you are running

      [Ian Fisk] 11:48:17
      I am, including all like what I mean, services like the back system, and the oh, putting, including all of that, too, putting all of those things

      [Ian Fisk] 11:48:28
      So I'm including including the 15 person staff that we run the place, who add, plus the cost of the hosting, the facilities plus the cost of operating the storage, external networking, etc.

      [Fernando Harald Barreiro Megino] 11:48:40
      I mean the list prices in particular. If you go on the month.

      [Fernando Harald Barreiro Megino] 11:48:45
      so I've been told not to compare. I I I compared it myself with a Usda tool, and if you use on demand instances, it they considerably higher our subscription agreement is very similar to our usda 2 without saying How much we used to costs.

      [Ian Fisk] 11:49:01
      Right.

      [Fernando Harald Barreiro Megino] 11:49:17
      because it Oh, the one because because it creates a conflicts and fights.

      [Enrico Fermi Institute] 11:49:26
      no, but as a quality of service for the the If you're going to compare it to a tier, 2, right is the quality of service that you have to provide for the storage.

      [Enrico Fermi Institute] 11:49:38
      The same on Google Cloud as it is for midwest tier, 2, for for example.

      [Fernando Harald Barreiro Megino] 11:49:45
      I mean

      [Enrico Fermi Institute] 11:49:46
      Because that has an operate that has an operational effect

      [Fernando Harald Barreiro Megino] 11:49:51
      I mean. My opinion is that the quality of service in Google is something that no, I mean, they simply have thousands of.

      [Enrico Fermi Institute] 11:50:02
      No, I mean the Atlas service, the Atlas services that are running the Rc.

      [Fernando Harald Barreiro Megino] 11:50:02
      So the quality of yeah.

      [Enrico Fermi Institute] 11:50:08
      That runs at Google right? Can it, for instance, be a nucleus so that it can serve out data and stuff like that?

      [Enrico Fermi Institute] 11:50:13
      That's what I mean by quality of service not their underlying layer.

      [Enrico Fermi Institute] 11:50:16
      That's that's all good. What I really mean is the Lwlcg layer of services and code that have to run on it to to have it.

      [Fernando Harald Barreiro Megino] 11:50:16
      So

      [Enrico Fermi Institute] 11:50:27
      Behave like a typical tier. 2 grid site

      [Fernando Harald Barreiro Megino] 11:50:30
      I I've been working on the panel site, and that works as good as any It was tier, one or room I mean.

      [Fernando Harald Barreiro Megino] 11:50:40
      It's got to know if better, because I don't look that much.

      [Fernando Harald Barreiro Megino] 11:50:43
      I did it. Other sites, but it's completely flat, so there is never wasted cause.

      [Fernando Harald Barreiro Megino] 11:50:49
      The failure rate is very, very low, and when there is a value rate it's usually done.

      [Fernando Harald Barreiro Megino] 11:50:57
      It's usually cost to by misconfiguration. Just that.

      [Fernando Harald Barreiro Megino] 11:51:01
      I don't have the. It's new. We're running it since a month.

      [Fernando Harald Barreiro Megino] 11:51:07
      And, for example, I underestimated the disk. Things like that this and half a year in my the pandemic will run as good or pay.

      [Ian Fisk] 11:51:21
      I I guess I do. I would I just go back to my point, which I think is important for the report to follow is that we're always in a situation where we're making a choice in terms of how we allocate the resources and we're always having to cut back on something else to afford so

      [Ian Fisk] 11:51:33
      that so in in some sense, we're at some point, we're gonna have to make an argument that says, using the cloud is less expensive by so metric.

      [Enrico Fermi Institute] 11:51:53
      Seekashka has had his hand raised for a while to jump in

      [Kaushik De] 11:51:57
      yeah. So I wanted to address 2 of the points that we have had extensive discussion on.

      [Kaushik De] 11:52:07
      One is the elasticity doesn't seem to care.

      [Kaushik De] 11:52:11
      They're perfectly fine. If you want to use 100,000 cores for one month instead of 10,000 cores, how the duration of the project we are planning to test that when we moved the later parts of our planned program of work And are in these studies with Google But we certainly

      [Kaushik De] 11:52:30
      plan to test both models. The only reason why we started with the flat model is because that's what our current computing systems are designed for, And we wanted to give that a quick. Test.

      [Enrico Fermi Institute] 11:52:39
      Okay.

      [Kaushik De] 11:52:48
      We don't have to continue this way. We could not run anything for 3 months, and then we could run 5 times higher for a month.

      [Kaushik De] 11:52:56
      It's it's completely elastic up to the the limits of the resources of the data center.

      [Kaushik De] 11:53:03
      And then, of course, one can scale up by going to multiple data centers.

      [Kaushik De] 11:53:06
      So that's the elasticity issue. Even with the subscription model, because we have discussed it with them.

      [Kaushik De] 11:53:14
      I think the cost, comparison issue. I think it's an important one, but I think we have to be a little bit careful, because we will never come to a conclusion if we ask Google or the team that's using Google to come up with the cost off.

      [Kaushik De] 11:53:31
      A T. I wanna tear to side, I mean, that's just will never work.

      [Kaushik De] 11:53:35
      You know that it will never work, because every time somebody from the outside tries to value it the cost of.

      [Enrico Fermi Institute] 11:53:38
      Me.

      [Kaushik De] 11:53:42
      Do you want N. 2 to decide? It will be. There will be something that people will find to say that it was not done correctly.

      [Kaushik De] 11:53:53
      So. I think it's that you're one, and 2 sites who actually truly have to do the costing, and they actually have to do the comparison.

      [Kaushik De] 11:54:03
      And they actually have to decide what is best for them to have on prem resources or offering resources.

      [Kaushik De] 11:54:09
      And in what particular combination do they want to do it?

      [Kaushik De] 11:54:12
      I think it's up to the tier. One; and tier 2 sites.

      [Kaushik De] 11:54:16
      It's not up to the people who Alright, using Google and Amazon And it's certainly not up to the salespeople from Google and Amazon to tell us how to they can do it cheaper all.

      [Kaushik De] 11:54:25
      We can do. And I think that's what we are focused on doing.

      [Kaushik De] 11:54:28
      And I think that's really last video updates, 2 plots.

      [Kaushik De] 11:54:32
      Here is what is the cost of doing this and that on Google and Amazon.

      [Kaushik De] 11:54:39
      And I think that's how we make progress. Is we alright at as transparent as possible, as many different kinds of tests as possible.

      [Kaushik De] 11:54:48
      Explore all the possibilities that we can do.

      [Kaushik De] 11:54:54
      And then we an experimentalist. We do that through this project, over the next 15 months, and then we provide that information Then it is up to tier one tier.

      [Kaushik De] 11:55:05
      2, since people of various kind to come and argue this way and that way, and I don't think we should be as a technical thing, we should be part of that

      [Enrico Fermi Institute] 11:55:15
      No; but you have the caution. Yeah, if hold them the lost capabilities.

      [Enrico Fermi Institute] 11:55:23
      And so I'll use an example as a lot of the engineering was pulled out of the physics departments and went to the National Labs University groups lost capabilities.

      [Enrico Fermi Institute] 11:55:34
      They couldn't do certain things on detector projects. This will be the same.

      [Enrico Fermi Institute] 11:55:39
      This we have to quantify that effect? If you were to, for instance, move all the compute to the clown, what would we lose

      [Kaushik De] 11:55:46
      I complete. I completely agree with you, but those are not part of it.

      [Kaushik De] 11:55:50
      Technical study of what we can do on Google and Amazon.

      [Kaushik De] 11:55:54
      Those are really discussions within the field of how we move our field forward.

      [Kaushik De] 11:55:59
      I I think we should separate the 2. I don't think we should mix up the 2.

      [Kaushik De] 11:56:02
      I think we should look at the quality of service. I think we should look at the type of service.

      [Kaushik De] 11:56:07
      I think we should look at the services that are actually global and and and provide that we look at the past.

      [Kaushik De] 11:56:17
      That's the scope of what we're doing beyond that.

      [Kaushik De] 11:56:21
      Of course, is up to the field to decide

      [Enrico Fermi Institute] 11:56:24
      But even in the technical cost thing you, because of the hour, provided labor to it.

      [Enrico Fermi Institute] 11:56:29
      Don't we also have to capture good? This? The labor needed to to have the same quality of service as same from the experiment, I'd say typical tier 2

      [Eric Lancon] 11:56:51
      yes, so I wanted to come back on a few statements which were made, I think we're we need to be very careful about the general statement, like it's cheaper than a tier 2 this state those statement.

      [Eric Lancon] 11:57:15
      Do not represent us at last; and as you should be indicated on the slides, if they are such that statements there, there is a working group with and Atlas which is being set up to supposedly look at the Tco for operating on the Cloud and the Tier 2 so may want.

      [Eric Lancon] 11:57:33
      to to wait for the conclusion of this working group. What I would like to say is that that Will Kevin?

      [Eric Lancon] 11:57:42
      Well, very well aware of the cost of cloud compared to on-site operation, because for any big investment we platform comparison of the cost.

      [Eric Lancon] 11:57:55
      Okay, the on the cloud, including the feminist Google, Do we discount which is being used by Atlas?

      [Eric Lancon] 11:58:05
      And we have, as it was noticed by Yan Fisk.

      [Eric Lancon] 11:58:12
      the cost, really prohibitive. I cannot give you exact numbers, because we cannot provide the touches cost.

      [Eric Lancon] 11:58:23
      But you know really much slower than any solution which is available on the cloud.

      [Fernando Harald Barreiro Megino] 11:58:43
      regarding your first comment. No one or I didn't hear anyone saying that this is cheaper than Usdo.

      [Fernando Harald Barreiro Megino] 11:58:52
      I don't know where you got. No, I said, It gets similar with the with the user subscription

      [Enrico Fermi Institute] 11:59:01
      Okay.

      [Fernando Harald Barreiro Megino] 11:59:05
      Well, okay. In in any case, explicitly, I didn't put a usual cost, and for the Tcr.

      [Fernando Harald Barreiro Megino] 11:59:13
      it's what I said at the very beginning, but the Tc.

      [Paolo Calafiura (he)] 11:59:27
      yeah, I won't. I won't want to want the what. Oh, sorry me.

      [Paolo Calafiura (he)] 11:59:35
      I didn't see the way ends. Can I go

      [Paolo Calafiura (he)] 11:59:39
      I apologize. So then I want to make a comment, which is that one thing we have to keep in mind is this derivative?

      [Paolo Calafiura (he)] 11:59:50
      And so the the costs, the comparison with cost of proud to do on the resources, was just 3 close The first time we did it, which was about 2,016.

      [Paolo Calafiura (he)] 12:00:04
      I mean it was like order of money. It's more expensive.

      [Paolo Calafiura (he)] 12:00:07
      And while I agree with Eric, that is, that is not yet done at the cost comparison.

      [Paolo Calafiura (he)] 12:00:14
      And now it's actually it's actually what to do in the cost comparison, and probably it will come.

      [Paolo Calafiura (he)] 12:00:20
      It will come still more expensive On the cloud side than than the own side, but not by a factor of 10.

      [Paolo Calafiura (he)] 12:00:27
      So I think one of the important roles of these investigations is to be ready to be ready in case, for some reason, Google Cloud or aws, they, they can buy cpu and storage at prices We Don't.

      [Paolo Calafiura (he)] 12:00:44
      Have access to So let's not just see it, as is a short term effort.

      [Paolo Calafiura (he)] 12:00:52
      But there's an effort which is thinking about what's gonna happen in 5 years.

      [Eric Lancon] 12:00:55
      no, I agree. I agree. I agree, Paula. We should keep a close eye on the cost, and if the services for for equivalent level of service cheaper on the cloud, we should consider going to cloud solution for some of the application.

      [Steven Timm] 12:01:25
      So we have a couple of comments. One is that go 2.

      [Steven Timm] 12:01:29
      3 years ago, just before Covid. It was a very big study done. It's for me.

      [Steven Timm] 12:01:35
      I tried to. What would it cost to run the Ruben Data Center here as opposed to ringing on Cloud and tried to.

      [Steven Timm] 12:01:42
      And we tried to cost that all. Okay, do not exactly know all the numbers there, but there was a very comprehensive study that was done, and that's sort of that's A data point.

      [Steven Timm] 12:01:51
      Yeah, would I could. So we must be familiar with it already. New fees could probably get there for you.

      [Steven Timm] 12:02:00
      If anybody wants to add, let's come in here

      [Ian Fisk] 12:02:01
      bye, I think that that those numbers are public, and people want to see them.

      [Steven Timm] 12:02:05
      Okay? Yeah, huh? Right

      [Steven Timm] 12:02:13
      Great. No, no! The other thing that we've noticed from 6 years ago, when we first did the big Cms.

      [Steven Timm] 12:02:21
      Demo on Amazon until now is that probably by pricing is gone up by factor 2.

      [Steven Timm] 12:02:26
      Hi! Amazon, so you could in 2,016 you could go a 25% amount of your price.

      [Steven Timm] 12:02:33
      You can't do that anymore and get any cycles the th that's of interest, I think.

      [Steven Timm] 12:02:38
      And then the third thing is that as far as costing, what does it cost to run it here to on the cloud as opposed to web through our many estimates of that on the website some would agree by factor 4, we're always going to be difference in but sooner or later

      [Steven Timm] 12:02:59
      we're going to go to. Do we say we need more money to put more computer.

      [Steven Timm] 12:03:03
      We need more money for another building. And we're not gonna get it. So there will be a limit to how much we can put on a site.

      [Steven Timm] 12:03:10
      And then that may be the driver eventually to why we need to go to the cloud eventually.

      [Enrico Fermi Institute] 12:03:22
      Okay, thanks, Keith. We'll go to Tony.

      [Fernando Harald Barreiro Megino] 12:03:34
      This is 10,000. This is using the cost, I mean do all of the bus except the last one.

      [Fernando Harald Barreiro Megino] 12:03:41
      It's 10,000 little cpus where you run.

      [Fernando Harald Barreiro Megino] 12:03:44
      Whatever you want, 7 petabytes of standard objects store, and 1.5 petabytes of egos per month.

      [Fernando Harald Barreiro Megino] 12:03:52
      It's not it's support, whatever you're using, for it's not strictly related to a simulation

      [Enrico Fermi Institute] 12:04:03
      Ask a question. Go ahead. You have the the hoax in place, Amanda.

      [Enrico Fermi Institute] 12:04:09
      Capture, What Cpu model the job reports cause.

      [Enrico Fermi Institute] 12:04:16
      Then we can turn around and figure out for the number of virtual cpus you've used for some period of time.

      [Enrico Fermi Institute] 12:04:22
      But the hep specho. Sex equivalent is, and then then one can compare it, at least in you know we know what the tier two's provide in terms of Hs. O.

      [Enrico Fermi Institute] 12:04:36
      6,

      [Fernando Harald Barreiro Megino] 12:04:38
      So I I I have not looked into that for many, for most of the great sites it gets reported back as the okay.

      [Fernando Harald Barreiro Megino] 12:04:50
      Yeah, like the pilot looks for the for that information and reports it back.

      [Enrico Fermi Institute] 12:04:52
      Okay.

      [Enrico Fermi Institute] 12:05:01
      Then it might be very interesting to compare that to the benchmark jobs, where you know I should take with enough spread, so that you get the distribution of what they're actually giving you Yeah, because at the end of the day right we get paid in the us dollars per H. S. O.

      [Enrico Fermi Institute] 12:05:23
      6, okay.

      [Enrico Fermi Institute] 12:05:31
      Comment from Ian

      [Ian Fisk] 12:05:34
      I thought Stephen was done, but yeah, I guess But I just wanted to go back to this bel labor.

      [Ian Fisk] 12:05:44
      This point about cost, and I think the like. I think one thing that we need to assess as a field is to what are the economics that changed as every time we do this evaluation we find that things are a little bit closer to being competitive and at some point maybe they will make the transition over

      [Ian Fisk] 12:05:59
      but something has to happen which is then basically either the the economy of scale associated with am aws and Google has to be so large.

      [Ian Fisk] 12:06:09
      But they can do it more or less, or work cheaper than we can, and still make money, And whether that's a location facility, or whether that's the fact, that we use all our resources only a fraction of the time, or whatever But something has to change, because at the end, of the day like for the

      [Ian Fisk] 12:06:24
      same reason I don't drive a rental car to work.

      [Ian Fisk] 12:06:28
      if you have a facility, if you have a facility which you're using all of the time was you operated yourself.

      [Ian Fisk] 12:06:33
      It's very hard for someone to undercut you, unless they can, either.

      [Ian Fisk] 12:06:38
      They're so large, or they're so cheap, or they bring in cheaper places something we must be able to identify the thing that is going to make it competitive.

      [Gonzalo Merino] 12:06:58
      yeah, so, just hi. Briefly a brief comment just wanted to subscribe to some a previous comment from Kaoshik And I must say I'm a little bit surprised about all this this discussion, whether it is cheaper, or more or more expensive I really think that the I mean has like 170 sites

      [Gonzalo Merino] 12:07:17
      so that the answer will totally be different. For each of those sites.

      [Gonzalo Merino] 12:07:20
      So I think they're going through a Gaussian, which I totally subscribe Is that the value in this exercise, or at least part of it is, Okay, we need to get these numbers like the fernando show so that's super useful.

      [Gonzalo Merino] 12:07:31
      So what's the cost of running this in the cloud in a commercial cloud?

      [Gonzalo Merino] 12:07:35
      And then is, is for each of those 170 sites to get this number, and then compare to their internal costing which completely, will change.

      [Gonzalo Merino] 12:07:43
      I mean depending on size, depending on country. The labor cost is factors.

      [Gonzalo Merino] 12:07:49
      Difference from different countries, So so I mean I I I don't think I mean discussing on whether it's more expensive or or more or cheaper than these on that, side I think it's useless, and whether it's Fermila or a Tier 2 here in Czechoslovakia

      [Gonzalo Merino] 12:08:01
      or in Spain. It's it's for each of the sites in every country to get this number compared to their cost, which everybody knows.

      [Gonzalo Merino] 12:08:10
      And then react accordingly. I would say that that's the value I see, and the and the example from the rental card.

      [Enrico Fermi Institute] 12:08:37
      shooting.

      [Shigeki] 12:08:41
      yeah. One comment I have is this: There, there's really no incentive for any of these cloud providers to become the lowest cost per provider They're in the business to make money, right?

      [Shigeki] 12:08:54
      And they have hordes of the accountants and supercomputers that are that are constantly hedging the cost of everything right there.

      [Shigeki] 12:09:03
      There, there, there business model is value. Add not not to drop to the lowest, lowest cost per provider.

      [Shigeki] 12:09:11
      Right.

      [Dirk] 12:09:21
      the power of competition, I mean they are competing against each other

      [Dirk] 12:09:29
      So

      [Enrico Fermi Institute] 12:09:34
      Yeah, yes.

      [Dirk] 12:09:36
      I mean, that's what the isn't. This saying in any mature market?

      [Dirk] 12:09:39
      The price of a service will basically go down so that the profit approach to 0

      [Enrico Fermi Institute] 12:09:48
      Have you flown recently That's a mature market, and the prices are going the other one.

      [Dirk] 12:09:56
      Well, big slash supply. So that's that's the thing about these data centers.

      [Fernando Harald Barreiro Megino] 12:10:09
      Okay, I think we can go then to the next slide with K.

      [Fernando Harald Barreiro Megino] 12:10:16
      You.

      [Kenyi Paolo Hurtado Anampa] 12:10:19
      okay. Yup

      [Kenyi Paolo Hurtado Anampa] 12:10:28
      We'll take that as now. So Okay, moving on on the Cmx experience.

      [Kenyi Paolo Hurtado Anampa] 12:10:33
      So I tried to summarized and boot. Have a few numbers there from those sources.

      [Kenyi Paolo Hurtado Anampa] 12:10:40
      This is from one paper and some slides of the world.

      [Kenyi Paolo Hurtado Anampa] 12:10:45
      That was done 5 6 years ago, when Amazon on Google Cloud.

      [Kenyi Paolo Hurtado Anampa] 12:10:49
      So again, this numbers are not upgrade, are from 2,016 to substance 17.

      [Kenyi Paolo Hurtado Anampa] 12:10:57
      So things have changed. But the conclusion, the high, level summer of the the conclusion there is that the cause of per core hour for both aws and Google Cloud were close to similar But then the work that was on an amazon was on over the cover of a few days about 8 days and you can see in

      [Kenyi Paolo Hurtado Anampa] 12:11:20
      the stop. Right. Plot the and th the green. That is the production on awws.

      [Kenyi Paolo Hurtado Anampa] 12:11:29
      Just in, kept out on the formula side, and then on the bottom block at It's what do you have from Google Cloud?

      [Kenyi Paolo Hurtado Anampa] 12:11:42
      And the work on Google cloud was done over the course of about 4 days.

      [Kenyi Paolo Hurtado Anampa] 12:11:47
      the goal was to double the size in terms of total available course.

      [Kenyi Paolo Hurtado Anampa] 12:11:57
      With respect to the what we had in the global pool for the demo, the demo was done using.

      [Kenyi Paolo Hurtado Anampa] 12:12:04
      Yeah, that production, simulation workflows and the on premises.

      [Kenyi Paolo Hurtado Anampa] 12:12:11
      Estimate that I put on the paper on the the slides are, what's the estimate on the paper? Again?

      [Kenyi Paolo Hurtado Anampa] 12:12:18
      You have the services linked from Archive. They're in this.

      [Kenyi Paolo Hurtado Anampa] 12:12:25
      and then the the other factor just focus on the operational airport.

      [Kenyi Paolo Hurtado Anampa] 12:12:32
      So for this I got you put from the team from Cloud and the completion is that there was initial effort mostly related to monitoring.

      [Kenyi Paolo Hurtado Anampa] 12:12:47
      And this was to prevent waste of computer resources to track It's stuck jobs or jobs getting or going to slow identify.

      [Kenyi Paolo Hurtado Anampa] 12:12:59
      I bones identify huge log files that call that the cutting current to high transfers.

      [Kenyi Paolo Hurtado Anampa] 12:13:06
      But then after that, the current maintenance is low in terms of effort, with an estimate of just one.

      [Kenyi Paolo Hurtado Anampa] 12:13:17
      If you for that, occasional for things like, for, for, for example, they th this it is still maintained up to up to today.

      [Kenyi Paolo Hurtado Anampa] 12:13:27
      For okay. Tms wants to use it again. And so I I feel a few months ago they didn't work on integrating support for Id Token so that you can, for example, for both main, number you can go.

      [Kenyi Paolo Hurtado Anampa] 12:13:49
      So alright. But we have the last slides with this basically strategy, considerations and discussions, and these are just some bullets to.

      [Kenyi Paolo Hurtado Anampa] 12:14:01
      We talked a lot about cloud costs already, and there are some other bullets there right related to the egress. Cost what is the role?

      [Kenyi Paolo Hurtado Anampa] 12:14:10
      On the other cloud in double your Cg. Discussions how to make the cloud.

      [Kenyi Paolo Hurtado Anampa] 12:14:26
      No, we we are actually at the end of the schedule.

      [Kenyi Paolo Hurtado Anampa] 12:14:32
      So it's it's lunch. Break, or is I don't know.

      [Fernando Harald Barreiro Megino] 12:14:38
      we still have a little bit of time. So I think that I mean on the on the cost.

      [Fernando Harald Barreiro Megino] 12:14:46
      People discussed it a lot, but this is the opportunity to to discuss any other like where your discussion, like, for example, egos, costs or other worries about the Cloud, or are there any particular ideas of how we can make better use of the cloud like to exploit elasticity

      [Fernando Harald Barreiro Megino] 12:15:13
      that I use what whatever Gpus or it is, discussed. Lansium.

      [Dirk] 12:15:32
      I I just wanted to it. We already talked about elasticity, and I just wanted to maybe focus on one of the points on the on the slide.

      [Dirk] 12:15:42
      Say that the different planning horizon versus our own equipment, and that gives you kind of a different layer of elasticity, because when you purchase equipment, it's not only that you have a certain number of deployed deployed cause in your data center but it's also when you purchase

      [Dirk] 12:15:58
      the equipment you usually basically make a commitment for the next 3, 4, or 5 years.

      [Dirk] 12:16:04
      Whatever the retirement window now is for for hardware that you buy.

      [Dirk] 12:16:08
      It's gone up a bit cloud. You don't have to make that commitment.

      [Dirk] 12:16:15
      Now. The the thing is, though, in our science usually we have pretty stable workloads, so we can't really take full advantage of that.

      [Dirk] 12:16:23
      So usually we buy equipment for 4 years, and we we expect, I mean, we have year to.

      [Dirk] 12:16:30
      Year We have always the work to get busy, but I'm not looking out.

      [Dirk] 12:16:35
      There's the the dip in and before Hlc.

      [Dirk] 12:16:42
      Comes up. I don't know if that's something that where Cloud maybe could help.

      [Dirk] 12:16:48
      If you basically, if at that point we were like 20% cloud, you could say, Okay, for these years, for the off years shutdown years, you could just not buy any cloud cycles.

      [Dirk] 12:16:59
      I'm not sure how that would play with subscription at the renewal like. If you are in a subscription model, and you could just skip a renewal and then resume. A year.

      [Dirk] 12:17:08
      Later. But that's a that's a possibility. And you don't kind of you really don't have that with purchase equipment, because you you you kind of continuously keep buying equipment.

      [Dirk] 12:17:21
      Just to not have like With everything retired all at once.

      [Dirk] 12:17:24
      I mean, you kind of cycle over your data center of all.

      [alexei klimentov] 12:17:41
      I I think this is a very simplistic approach.

      [Enrico Fermi Institute] 12:17:41
      Still had Alexa

      [alexei klimentov] 12:17:48
      I I I think what at least we are trying to do in Atlas.

      [alexei klimentov] 12:17:53
      We are trying to integrate calls in our computing model, and it is not Oh, we described. I want to remind you what one of the first topics at least which I remember about using clothes for was done by Bell experiment.

      [alexei klimentov] 12:18:14
      Not by Bill, but by Bill when they needed a Monte Carlo campaign to conduct a Monte Carlo campaign, and the way you signed it.

      [alexei klimentov] 12:18:23
      But for wham it is cheaper just to buy cycles and to run this Monte Carlo campaign.

      [alexei klimentov] 12:18:32
      So I I think just this comparison, and also what was mentioned before by several people.

      [alexei klimentov] 12:18:40
      But is it replacement of how with your tools?

      [alexei klimentov] 12:18:44
      Of course not. Post notice, not replacement, but it is resources which we can use, and elasticity for me.

      [alexei klimentov] 12:18:52
      It is one of them. Main features which we can use, and as Paula mentioned also before we go to purchase something new what we don't have now, we can try it in the cloud.

      [alexei klimentov] 12:19:07
      I also kind of disagree with statement with our workforce.

      [alexei klimentov] 12:19:12
      very stand up, or whatever you use, because what we see even now, and I think it will.

      [alexei klimentov] 12:19:19
      be in this direction, but you have new workforce, more complex for falls, which we at least an office.

      [alexei klimentov] 12:19:29
      We did not have during ground through and for high luminos, she will be more and more like that.

      [alexei klimentov] 12:19:33
      So that's why I think that's the problem is more complex, And we need to address it on more or complex way, and not try to what I'm afraid.

      [alexei klimentov] 12:19:45
      But if you start to, you know, split it on small pieces of, then we all know.

      [Eric Lancon] 12:20:02
      yes, sorry was Mr. Do you agree with Alexey that they are?

      [Eric Lancon] 12:20:09
      There are more complex workflows coming, and there are a need to adapt.

      [Eric Lancon] 12:20:13
      why, the what? I don't follow fully, the conclusion is that the cloud is most suited for this.

      [Eric Lancon] 12:20:22
      the facility needs to work to adapt to the new requirements.

      [Eric Lancon] 12:20:28
      And that's what make at the end the comparison.

      [alexei klimentov] 12:20:46
      if you could find my comment, I fully agree with you. I I fully agree with you, and that's why.

      [alexei klimentov] 12:20:52
      what will be. Try and pineapple, that is, a full chain for me.

      [alexei klimentov] 12:21:00
      It is bigger ones. And first 8 days also to try.

      [Enrico Fermi Institute] 12:21:29
      so one comment I had, I mean we we've spent a lot of time talking about how the clouds hook into the existing workflow system, panda and Whatnot.

      [Enrico Fermi Institute] 12:21:39
      And does it make sense to further, you know, talk about, or explore how clouds can either be used as an analysis facilities or extending analysis facilities in some way you know 1 one of the things that that you know the users might want work for example, or things like you know exotic type

      [Enrico Fermi Institute] 12:22:01
      of things, or or accelerators. You know. Gpus, things like that.

      [Enrico Fermi Institute] 12:22:05
      Can we use clouds to can sort of pat out those kind of resources that analysis facilities? Does that? Does it make sense to explore that

      [Fernando Harald Barreiro Megino] 12:22:14
      so in in all of the problems, both in that there there is always a possibility for the user to get account and really to whatever they need.

      [Fernando Harald Barreiro Megino] 12:22:31
      if it's more of a a central analysis, facility

      [Fernando Harald Barreiro Megino] 12:22:39
      The the analysis facilities that we are usually talking about, the Atlas Or Cms.

      [Fernando Harald Barreiro Megino] 12:22:46
      For that there will also be in the Atlas Project, and R. And D.

      [Fernando Harald Barreiro Megino] 12:22:51
      To to extend that. And okay, because presented some ideas to do that last week or 2.

      [Enrico Fermi Institute] 12:23:14
      It is Mike supporting. It so something that was interesting, and I don't have it right at my fingertips.

      [Enrico Fermi Institute] 12:23:20
      But Purdue actually got a Purdue university, actually got a pretty big grant from Google to set up a system where basically their badge system can burst into the Google cloud But they also have all the vpns and whatnot set up And the images are the same image as their you know, their

      [Enrico Fermi Institute] 12:23:42
      compute farm is, and with the VPN setting up the networking or whatnot, the the remote hardware that the cloud hardware is put on, quote the same as just the regular best they have there so You know outside of latency, or whatever you're you're basically can run just

      [Enrico Fermi Institute] 12:23:57
      slam in contour, or I think they run the storm there. You could slam in slum jobs and run whatever you want, So there's definitely work that's been done.

      [Enrico Fermi Institute] 12:24:32
      So maybe to bring up another topic from yesterday, we and we we mentioned here a little bit about, you know, using Cloud to run some kind of particular campaign or What have you does does that have any effect on on how we think about pledging clouds

      [Enrico Fermi Institute] 12:24:53
      And then, general, are there any any discussions over, want to have about pledging clouds

      [Enrico Fermi Institute] 12:25:04
      Turk. You want to jump in.

      [Dirk] 12:25:06
      yeah, I think I think cloud the the cloud fits into the discussion we had yesterday with pledging.

      [Dirk] 12:25:15
      I think, under the current rules to pledge a cloud, you would have to pledge a certain minimum amount.

      [Dirk] 12:25:22
      Of course. So if you replicate aside where you business always give like run, keep 4,000 calls running.

      [Enrico Fermi Institute] 12:25:23
      Yeah.

      [Dirk] 12:25:29
      You could pledge the 4,000 cores, but you couldn't, couldn't really take advantage of elasticity.

      [Dirk] 12:25:35
      So you kind of would have to pledge to lower boundary, because at the moment, with within some limits, because even even grid sites are allowed to cool below the floor for limited amount of time, I think so But but it puts limits on your on your basically how flexible you can use the

      [Dirk] 12:25:53
      resources The same problem we have with the scheduling on the Hpc.

      [Dirk] 12:25:57
      That that you basically you can't just keep the keep it off for 10 for 11 months of the year, and then use up everything in a month that wouldn't work with How the pledges are structured right?

      [Dirk] 12:26:08
      Now, and what the Rules are.

      [Enrico Fermi Institute] 12:26:09
      We pledge Hs. O. 6. Not course

      [Enrico Fermi Institute] 12:26:21
      So I but I. The point is that we have to figure out right.

      [Enrico Fermi Institute] 12:26:27
      If you gotta even consider pledging cloud research how to put it in a unit that is consistent with what we have So it's an apple staples.

      [Steven Timm] 12:26:59
      Yes, I was going back to the question of exotic resources.

      [Steven Timm] 12:27:04
      And I know they come with me yesterday that the exotic resources, such as the p machines of Amazon, the the Fpga is in the tensor, things or whatever are always the most highest price things you can get but you still have to weigh that as opposed to having more having them sit on

      [Steven Timm] 12:27:21
      site as somebody on premise, somebody singing and sucking up for all the time.

      [Steven Timm] 12:27:25
      And not being used all the time, at least, we don't yeah have a Dc.

      [Steven Timm] 12:27:31
      Use case for gpus or tensorflow with your fees, or whatever it was about that.

      [Steven Timm] 12:27:37
      So there is value, and I've heard that from management that they prefer.

      [Bockelman, Brian] 12:28:08
      yeah, I I just wanted to to. Maybe tackle something.

      [Bockelman, Brian] 12:28:13
      But what Doug said a little differently. It's I I'm worried less about the have spectacle.

      [Bockelman, Brian] 12:28:20
      6 equivalent. But the the fact that for cloud resources you probably need to pledge and Hep Speckle 6 h. Right?

      [Bockelman, Brian] 12:28:30
      Right, we we you know. It's it's the difference between kill a lot versus kilowatt hours, you know, at some aspect of the pledge.

      [Bockelman, Brian] 12:28:39
      Or, again, going to the the power. Grid analogy needs to be in kilowatt hours.

      [Bockelman, Brian] 12:28:45
      and what what what the benchmarks is, and I think it's less important.

      [Bockelman, Brian] 12:28:49
      but you know, How do you come up with a proposal that balances the fact that you do need some base capacity, and that's that's important.

      [Bockelman, Brian] 12:28:59
      But we it's very unlikely. A 100% of our hours need to be the the base capacity.

      [Bockelman, Brian] 12:29:06
      So, some combination of kill a lot and kill what hours and earth yeah analogies in our pledges.

      [Johannes Elmsheuser] 12:29:20
      right, a follow-up comment to this right, and at the end the pledges are, always as you say, a unit per year, right?

      [Johannes Elmsheuser] 12:29:31
      And we don't have for it's a unique Cpu architecture as well, right?

      [Johannes Elmsheuser] 12:29:37
      So there's always over the years with all the people appropriate, human, different, different kind of Cp architecture.

      [Johannes Elmsheuser] 12:29:47
      Still? What what and what's that before? Right? We we have more or less the same problem also on the grid.

      [Johannes Elmsheuser] 12:29:57
      We are also averaging there. So we don't have the same unit over and over at the same site.

      [Johannes Elmsheuser] 12:30:03
      Right. So in principle we are solving here, then, on the cloud the same problem.

      [Johannes Elmsheuser] 12:30:08
      So I I don't see this really as as proper automatic in that sense, because we we have exactly done the same thing, or plus 1015 years in the grid

      [Bockelman, Brian] 12:30:18
      yep, I I I don't think I'm following, cause what what we pledge on the grid is certain. Heps.

      [Bockelman, Brian] 12:30:27
      Spec, Oh, 6 capacity that that is available. Starting at a given time period.

      [Bockelman, Brian] 12:30:33
      Right, let me say we. Oh, but but it's it's

      [Johannes Elmsheuser] 12:30:34
      Right? And that's for one year, right? It

      [Johannes Elmsheuser] 12:30:40
      It's good for one year, and and at the site you don't have a specific unit unit of one Cpu: right?

      [Johannes Elmsheuser] 12:30:47
      You have always an average, and that was the argument before that.

      [Bockelman, Brian] 12:30:50
      Oh!

      [Bockelman, Brian] 12:30:55
      Hmm! No, no, no! But that's very different. It's it's not the average right cause.

      [Bockelman, Brian] 12:31:01
      I I can't come in and give you 12 times as much capacity.

      [Bockelman, Brian] 12:31:03
      In January, and and 0, it out for the next 11 months.

      [Bockelman, Brian] 12:31:07
      That that is most definitely not what the the mo use say.

      [Bockelman, Brian] 12:31:12
      It's very specific. He spectacular. 6 count available you know, depending on whether you're tier one or tier, 2.

      [Bockelman, Brian] 12:31:19
      I figure what the number or 85, 95% of the time.

      [Ian Fisk] 12:31:27
      right.

      [Johannes Elmsheuser] 12:31:28
      sure I but I agree that you you give an average basically over a certain time period.

      [Johannes Elmsheuser] 12:31:34
      I think we we agree here right and and as you say, we then have to say, Okay, you provided this.

      [Johannes Elmsheuser] 12:31:41
      Then 4 months, or for 3 months, or something like this. And this is then the pl.

      [Ian Fisk] 12:31:49
      No, I I guess also I'd like to argue that our pledging model, as it's right now, is probably not ideal, for that we have a model which is based on the fact that we have dedicated facilities we've been purchased, and the experiment's responsibilities to

      [Ian Fisk] 12:32:04
      demonstrate that over the course of 12 months they can average. Because they can use them in some average rate, that we both provision and schedule for average utilization and whether it's Hbc.

      [Ian Fisk] 12:32:14
      Or whether it's clouds, there's an opportunity to not do that, and we might find as collaborations that the ability to to schedule 5 times more for some period of a month, and allow you to hold them on a call for a year done was actually a much more efficient use of people's

      [Ian Fisk] 12:32:32
      time, and that our current existing, pledging model is sort of limiting.

      [Ian Fisk] 12:32:36
      I think they they. I believe Maria Geron, who's connected, presented this at Chef Osaka.

      [Ian Fisk] 12:32:41
      Probably 6 years ago. The concept of scheduling for peak, and it seems like we, because we have dedicated resources, and we have to show that they're well.

      [Dirk] 12:33:25
      yeah, and maybe maybe one complication with scheduling for Peak.

      [Dirk] 12:33:30
      You actually have to think about and justify using what you want to use for the peak.

      [Dirk] 12:33:36
      So it's it's more complicated to plan this, and steady state is You just keep it busy

      [Ian Fisk] 12:33:39
      it is more comfortable. No, it's it's it's more complicated to plan.

      [Ian Fisk] 12:33:44
      It requires people to be better prepared. It requires people to.

      [Dirk] 12:33:47
      Yeah. But that's maybe why it hasn't happened yet.

      [Ian Fisk] 12:33:49
      I right, but at at the same time it would allow, like, imagine that a 6 month Monte Carlo campaign was a one month, Monte Carlo campaign, and then Sp.

      [Ian Fisk] 12:33:58
      5 months, where people having to complete set for analysis, that might be a much more efficient.

      [Ian Fisk] 12:34:04
      And that's also, I think, a motivation for why you might want to go to clouds rates, we see, even if they were on paper more expensive, because you'd have to make some metric which is how much time people's time you're saving

      [Enrico Fermi Institute] 12:34:17
      which time are you trying to say you're saving

      [Ian Fisk] 12:34:22
      I would claim Oh, well, the entire collaboration time to physics, Perhaps I'm saying

      [Enrico Fermi Institute] 12:34:23
      Which people which people's time

      [Enrico Fermi Institute] 12:34:34
      How do you accurately measure without drawing a false conclusion?

      [Ian Fisk] 12:34:40
      Hi! I don't. I think it's difficult to.

      [Ian Fisk] 12:34:42
      I I think it's probably somewhat difficult to measure the inefficiency that we have right now, but I think you can.

      [Enrico Fermi Institute] 12:34:48
      Okay.

      [Ian Fisk] 12:34:49
      I think, without drawing a false conclusion, I think I can claim that the this particular way it's set up right now is designed to optimize a specific thing which is the utilization of particular just resortions

      [Ian Fisk] 12:35:14
      and that's I guess I'm claiming that's not the like.

      [Ian Fisk] 12:35:18
      If I assume that's the most important thing, because we spent all this money buying dedicated computers.

      [Ian Fisk] 12:35:23
      Yeah, that's a reasonable thing to say. We're not gonna let these things today.

      [Ian Fisk] 12:35:27
      We're not gonna over provision, but I think it's it's it's very difficult to say that you can state the that optimization was designed to use this particular resource happens to also be exactly the perfect optimization.

      [Ian Fisk] 12:35:40
      For these other kinds of resources like time to physics, like what a like!

      [Dirk] 12:35:56
      all efficient use of resources. I mean, that's the one thing, Cloud, and and you buy the re.

      [Dirk] 12:36:02
      That's the one main difference. I I see you. You buy resources.

      [Dirk] 12:36:07
      You have them sitting on your floor, you might as well use them, because it's already paid for.

      [Dirk] 12:36:10
      So it's already paid for. So at that point, use doesn't okay energy costs whatever.

      [Dirk] 12:36:14
      But you, you kind of have to keep him busy. Hbc.

      [Dirk] 12:36:16
      And Cloud, You kinda have to. You justify because you're more elastic.

      [Dirk] 12:36:19
      So you get the allocation, and especially with Cloud. You You wanna make use of like flexible, elastic, scheduling.

      [Dirk] 12:36:28
      So at that point you have to justify each use So it's it's more complicated to to do that.

      [Dirk] 12:36:34
      But hopefully, if if you do it right, you get a more efficient use of resources out of it.

      [Enrico Fermi Institute] 12:36:43
      But how do you measure that

      [Dirk] 12:36:46
      It's I don't know.

      [Enrico Fermi Institute] 12:36:50
      Because think of it, this rate is a 10% cut of what we're doing Now, as you let's say that 10% diverts to the cloud. Then you have to see if that 10% divert the 10% diversion would give, you more bang for the park

      [Ian Fisk] 12:37:21
      and I. Well, we we actually we did this only a standpoint in a country way for disaster.

      [Ian Fisk] 12:37:28
      Recovery, which would be, What would it cost you? The scenario is, I've messed up my reconstruction I need to reprocess things, and I only have a month.

      [Ian Fisk] 12:37:39
      what is there Is there a model which says, there's a reasonable insurance policy which says, I'm gonna use the cloud for that kind of thing.

      [Ian Fisk] 12:37:45
      And so in some sense, you can make arguments for like, where this is valuable in very specific situations like there's been a problem.

      [Johannes Elmsheuser] 12:38:25
      I have a completely different common to a question. On the third point you have here, with the bullet point data.

      [Johannes Elmsheuser] 12:38:32
      So safeguarding. Is this something of concern or not?

      [Johannes Elmsheuser] 12:38:40
      To all. Just we just say the with your team has to basically safeguard our data for well to against users who are repeatedly downloading this.

      [Johannes Elmsheuser] 12:38:54
      And and then we are safe. What is there something behind the other?

      [Fernando Harald Barreiro Megino] 12:38:56
      what.

      [Johannes Elmsheuser] 12:38:59
      Something other behind this data, safeguarding keyword.

      [Johannes Elmsheuser] 12:39:02
      Here.

      [Fernando Harald Barreiro Megino] 12:39:03
      Well, that's a comment that sometimes I hear that you don't want to have the like.

      [Johannes Elmsheuser] 12:39:27
      Okay, right, So that that's the computing model that you have always the, so to say, another unique copy of your raw data.

      [Johannes Elmsheuser] 12:39:40
      For example, in the cloud that would be behind that

      [Fernando Harald Barreiro Megino] 12:39:43
      yeah, So like, what overall the role is it like Can a cloud be a nucleus?

      [Fernando Harald Barreiro Megino] 12:39:50
      Can so for Cloud only be treated as about 10 temporary.

      [Fernando Harald Barreiro Megino] 12:40:00
      so th the point is to let people express any any worries regarding this

      [Ian Fisk] 12:40:12
      I guess I would like to express a worry regarding that which is that I don't think that any reasonable funding agency is going to let you make a custodial copy of the data in the cloud because there's no guarantee that they don't change the rate to become

      [Ian Fisk] 12:40:28
      prohibitively expensive to move things out or prohibitively makes best move things in.

      [Ian Fisk] 12:40:33
      And in the same way that the agency won't let you sign a a 10 year lease on a fiber without tremendous amounts of negotiation.

      [Ian Fisk] 12:40:40
      They're not going to allow you to make a commitment in perpetuity for data storage.

      [Ian Fisk] 12:40:44
      So I think that actually almost by definition puts the clouds in a very particular place in terms of storage and processing to things that are transient, and things that can be there recorded at the end of the Job and the things that are done at the end because otherwise you're in the situation

      [Kaushik De] 12:41:16
      yeah, coming back to the question of how to make the most out of the case.

      [Kaushik De] 12:41:20
      I mean one of the things that we have heard a lot over the past many years actually are the Ai Ml tools and capabilities and ecosystem on the cloud is that something we should continue to pursue is that something that should be added to the list in terms of are we missing out on something

      [Enrico Fermi Institute] 12:41:33
      Okay.

      [Kaushik De] 12:41:47
      or is that something that we think know how to do better with our own tools?

      [Dirk] 12:41:55
      there is a session in the afternoon actually on and D.

      [Dirk] 12:41:58
      It's specifically a machine learning, training, And we actually have an invited talk from Son.

      [Dirk] 12:42:04
      I think they I I think it's Hbc.

      [Dirk] 12:42:07
      Training on Hbc: but it's similar, I mean, it's both Hbc.

      [Enrico Fermi Institute] 12:42:21
      It's also the case that the clouds do have some kind of proprietary exotic cards right that they that aren't available to the general public that are really meant for machine learning applications.

      [Dirk] 12:42:37
      yeah, but they they had. The bigger question is, then, what role will machine learning play in?

      [Dirk] 12:42:46
      In our basically computing operations, going going forward. And I I don't know. We have the answer.

      [Dirk] 12:42:50
      Neither seems, not Atlas. The final answer on that.

      [Dirk] 12:42:53
      So it's a bit hard to to say. This is the way to go.

      [Kaushik De] 12:43:02
      I mean the one thing that yeah, I think we are.

      [Kaushik De] 12:43:11
      You know we have been trailblazers in many, many areas, but I think in when it comes to the production use of aiml when it comes to everyday use of aiml.

      [Kaushik De] 12:43:26
      I think cloud and business systems that do so much of it.

      [Kaushik De] 12:43:34
      how do we, or pull that up and access that?

      [Kaushik De] 12:43:40
      And I'm not just paranoid, but to me, for for or perfect production level activities, because I noticed that almost anything that we look okay nowadays that Google does anything from their own products like maps and this that to services, that they are provide I mean it's really heavily

      [Kaushik De] 12:44:08
      dominated with aiml. I mean, it's almost exclusively that we Dml. But are we?

      [Dirk] 12:44:21
      let me. Maybe I can make a comment because the like can.

      [Dirk] 12:44:25
      You yesterday showed a used case. Cms, where they basically ran a miniod production, which is basically you take the the aod, which is a larger analysis format, and then slim it down and do some recomputations.

      [Dirk] 12:44:37
      To get it to a Miniod, which is smaller and actually useful.

      [Dirk] 12:44:40
      Analysis, and they They are pushing for the model where they do machine learning algorithm.

      [Dirk] 12:44:47
      They basically use algorithm does use machine learning. But then, during the production phase, you run only the inference server. So it's not actually you're not running the the learning.

      [Dirk] 12:44:55
      And that's that's for me, is the bigger question.

      [Dirk] 12:44:58
      Because if you do a one time shot where you're done, you run your learning algorithms on a bunch of data that we have.

      [Dirk] 12:45:04
      You figure out what you want to do, and then you only run the inference.

      [Dirk] 12:45:08
      During the heavy lifting reconstruction. Whatever else you do, then that's I'm not sure to what extent this is really impacting the overall computing operations.

      [Kaushik De] 12:45:32
      Yeah, And another aspect of this is that elasticity comes in when you talk about training, I mean, unless you go to control continuous training models, people are trying to do so.

      [Dirk] 12:45:57
      how much these large training runs, how much capacity is.

      [Dirk] 12:46:03
      Are we really talking about is is that making an impact on our overall compute resource use

      [Kaushik De] 12:46:28
      yeah, and we under H speed the service already in that. That's as a service.

      [Dirk] 12:46:28
      Okay, So that.

      [Ian Fisk] 12:46:43
      I think that's probably one of the ideal applications for primarily for Hpc.

      [Dirk] 12:46:46
      Yeah.

      [Ian Fisk] 12:46:48
      Because they already have that kind of hardware, and it doesn't.

      [Dirk] 12:47:03
      The the one thing, though, is it? This kind of application? Will Will it goes, and we will make a comment on under the report.

      [Dirk] 12:47:10
      But it it by design. It kinda happens outside the current production systems and infrastructure So it's kind of standalone so I'm not sure to what extent it's it's really in scope.

      [Dirk] 12:47:22
      For the report

      [Ian Fisk] 12:47:22
      I I I think this is one of the places where the concept of scheduling for peak comes into play, because, as you go to more machine learning things that require training and high parameter tuning, before you start running you change when the computing is spent, you spend the computing beforehand, and

      [Dirk] 12:47:37
      Yes.

      [Ian Fisk] 12:47:39
      then it's much faster on things like inference. And so it is a place where, like the model that says we're gonna use them all in Dc.

      [Dirk] 12:47:56
      And it also, I mean, it's that's even what, where I see him watchings.

      [Dirk] 12:48:01
      If if if this like, exploring the sinking out of it, the pledging, such resources, if you assume that this resource use is significant, you want to be able to pledge it.

      [Enrico Fermi Institute] 12:48:14
      Okay.

      [Dirk] 12:48:15
      But it's a single perp purpose pledge, which is completely outside the the scope of what w pledging currently is.

      [Dirk] 12:48:22
      But you want to get some kind of credit for such a used case, so that's that's even worse than than just what we discussed so far, which is basically just adjusting the the pledging to be more.

      [Dirk] 12:48:37
      Like a time, integrated value, not just the in the Ac.

      [Ian Fisk] 12:48:41
      right, and and this, and the kind of resources we're talking about here are the most expensive things we have.

      [Dirk] 12:48:41
      Dc. Argument

      [Enrico Fermi Institute] 12:48:54
      So maybe that needs to be written in the final report, so that they get there's the idea to push for flexibility

      [Enrico Fermi Institute] 12:49:12
      Because it is a different thing. You really do want to use for the training stuff that's designed for it work so much better

      [Enrico Fermi Institute] 12:49:22
      Which makes it special cause. I specialized until our code stack uses.

      [Dirk] 12:49:40
      I mean, we're trying that, too. It's if we had. This is.

      [Dirk] 12:49:44
      This is active area of on D trying different approaches. I mean Cms: We have the the hlt.

      [Dirk] 12:49:50
      That's attracting Hot tracking basically runs on Gpu, And that says pretty significant speed up.

      [Steven Timm] 12:50:13
      good student, just with you guys in Lensium, but also for some of the other more exotic resources, is even more probably on the Hps on The Lcf.

      [Steven Timm] 12:50:23
      System instead, that there are opportunities for things that can be opportunistically can go and grab a couple, or the computer come back with useful stuff.

      [Steven Timm] 12:50:36
      there. You may want to think about. Do you need? Is there a sense redesigned the workload that has to happen to best exploit those kind of resources because some some more folks are more.

      [Steven Timm] 12:50:52
      If you pre, you lose everything, Basically, if you're running for 10 h, you get 12 to go, or something like that.

      [Steven Timm] 12:50:58
      So I mean, we hit on, for instance, that you could only get a 24 h job link if you submitted at least a 1,000 jobs.

      [Steven Timm] 12:51:08
      Say Rosa is, let me consider. Okay, I don't have any answers for that, but something you should keep in mind when you're planning or non conventional resources.

      [Steven Timm] 12:51:20
      If you make sure you can get more stuff done

      [Dirk] 12:51:23
      I I think that's that's where that's one of the differences between the approaches and Targeting Hbc: But that's that's mostly affects Hbc because cloud cloud just allows you to schedule whatever you're paying for it.

      [Dirk] 12:51:35
      So they they don't

      [Steven Timm] 12:51:38
      Well, Lensium can go down any time right

      [Dirk] 12:51:40
      They can; but in practice, I mean, if they go down every 30 min, it it probably would become unusable for us, so we kind of rely on the fact that, in in in essence, even though what what in principle can go down every 30 min It doesn't actually happen all that often and we we cover

      [Dirk] 12:52:00
      whatever we make it an efficiency problem. Basically, I'll I'll fail your handling codes and Our software.

      [Dirk] 12:52:06
      Stack can deal with it, and it just becomes an efficiency issue that goes goes into the cost.

      [Dirk] 12:52:10
      Calculation. I think, if it gets it gets more complicated than that, it becomes really really problematic to use the resources, and I know that Atlas has the harvest the model in principle, you can survive.

      [Dirk] 12:52:23
      Like you can make use of of very short time windows.

      [Dirk] 12:52:28
      But we don't have that in Cms, and I'm not sure how effective that is for Atlas, either

      [Fernando Harald Barreiro Megino] 12:52:46
      check. Can you link on What do you do you think we should close this session?

      [Dirk] 12:52:56
      Yeah, it's only I mean, it's less than 10 min. There.

      [Dirk] 12:52:59
      There was some talk about maybe putting one on the talk early, but that's not enough time, and that would probably trigger discussion.

      [Enrico Fermi Institute] 12:53:00
      The

      [Dirk] 12:53:07
      So we can go with it first in the in the next session.

      [Enrico Fermi Institute] 12:53:11
      Yeah, I think the discussions we've been having less 10 or 15 min lead nicely into the R.

      [Enrico Fermi Institute] 12:53:17
      And D Presentation.

      [Enrico Fermi Institute] 12:53:25
      Maybe we we break here unless anybody has any other cloud topics that they want to bring up.

      [Enrico Fermi Institute] 12:53:30
      I think this is the last session that's focused exclusively on cloud

      [Enrico Fermi Institute] 12:53:37
      Yeah, in the next session. We'll talk about some R.

      [Enrico Fermi Institute] 12:53:43
      And D things, and and networking

      [Enrico Fermi Institute] 12:53:53
      Okay, So maybe we break here and we'll we'll see everybody at at one o'clock.

       

      • 10:00
        Current Landscape and Use 20m

        ATLAS - (Google Project, AWS)
        CMS - (Lancium, Scale testing)

        Eastern Time

         

        [Kenyi Paolo Hurtado Anampa] 11:05:44
        Okay, So good morning. Everyone. Today, we will call queues.

        [Kenyi Paolo Hurtado Anampa] 11:05:51
        Okay, 2 different things. One is resources. This is going to be all of the morning session, and then in the afternoon we are going to talk mostly about networking and the assistant reference of what Hpc and Clouds and then R and D: So for the I would focus area, we

        [Kenyi Paolo Hurtado Anampa] 11:06:12
        will start with the just summarizing at the very high level.

        [Kenyi Paolo Hurtado Anampa] 11:06:17
        What Atlas and Cms have done, and the case of at last.

        [Kenyi Paolo Hurtado Anampa] 11:06:24
        Well, this is this has been okay. The the and shame they got a self contained your site, and this is they're they're linked to the Ukrainian, and they have their own screen Cdfs.

        [Kenyi Paolo Hurtado Anampa] 11:06:42
        We will This will be talking more detail in good next. Okay?

        [Kenyi Paolo Hurtado Anampa] 11:06:47
        And then for Pms: This is basically describing what it was done about 5 6 years ago.

        [Kenyi Paolo Hurtado Anampa] 11:06:55
        During the demo testing that Cms: did with production portfolios and the the the way this was done was by extending an existing Csi, and it more particularly the formula the resource with resources in the Cloud and This was done Via.

        [Kenyi Paolo Hurtado Anampa] 11:07:14
        He cloud again, this will be describing more detail in the next few sites, since this is since this was done this way, maybe in terms of production integration, we have the same reservations as Hpcs in terms of historic for work, on my chains which means that all data must be staged

        [Kenyi Paolo Hurtado Anampa] 11:07:36
        to existing sites.

        [Kenyi Paolo Hurtado Anampa] 11:07:42
        go on with the next slide, and this is for advice Fernando.

        [Fernando Harald Barreiro Megino] 11:07:48
        yeah, sure. Yeah, So this is the overview of what we are working on in Atlas.

        [Fernando Harald Barreiro Megino] 11:07:54
        So we have 2 main projects, the one on the left is on Amazon, and this comes through a Fresno from California State University.

        [Fernando Harald Barreiro Megino] 11:08:04
        And here we have basically, panda queue storage element, and also a squid, And those are always the 3 main cost components that we will have later with.

        [Fernando Harald Barreiro Megino] 11:08:13
        An it's also the the egos, and then the second part is Cook is the product that we have in.

        [Fernando Harald Barreiro Megino] 11:08:23
        Google it used to be. Us atlascentric But now, this year, since the middle of July, it became a worldwide, dotless project.

        [Fernando Harald Barreiro Megino] 11:08:35
        And so Atlas, is. A collaboration, is it's participating in the budget.

        [Fernando Harald Barreiro Megino] 11:08:42
        and here in this project, we do have a similar setup, us in Amazon with panda queue truth, storage element, and the squid.

        [Fernando Harald Barreiro Megino] 11:08:50
        But we also how we work like on a analysis facility prototype, with a 2 bitteran task.

        [Fernando Harald Barreiro Megino] 11:08:59
        so the integration of this of these cloud resources. We're done by the route team on the ponder team.

        [Fernando Harald Barreiro Megino] 11:09:07
        So we take a different approach done. If you are like trying to extend her side, and we we just generate our self-contained on cloud native side, in the case of truth, on the storage so it works in the way that to download the key for from Amazon or from

        [Fernando Harald Barreiro Megino] 11:09:28
        Google. And with that the key you can sign Url. And with the Url in the center Url, you say, you can upload a particular file until an hour from now, or you can download or Delete of and then this key needs to be put into ruthie and into fts so that they can generate

        [Fernando Harald Barreiro Megino] 11:09:47
        the assigned Url to with the downloads or the third party transfers for the compute path. It's all based on kubernetes and native integration.

        [Fernando Harald Barreiro Megino] 11:10:00
        There is particular. There is no nothing like a condor in the setup, and then we have Cvm Fs installed in the closer to our kubernetes planning That was one of the things that actually took most of the F to get at the very reliable

        [Fernando Harald Barreiro Megino] 11:10:18
        state, and then also the this quick part I mean, that's you can either run it in part as a part of the who could need this cluster in, Google for example, I just run load balance load balance the instance, great and the other thing that for the computer I always use is the outer

        [Fernando Harald Barreiro Megino] 11:10:41
        scaling. So when there are no jobs cute, for example, the panda compute part.

        [Fernando Harald Barreiro Megino] 11:10:48
        It shrinks to a minimum, and then, if you submit a lot of tops it, the the cluster grows up to, or the limit, or as much as needed for hosting all of the jobs yeah the the setup, is, it's not bound to any particular cloud provider

        [Fernando Harald Barreiro Megino] 11:11:07
        it's just Stanford protocols and technology.

        [Fernando Harald Barreiro Megino] 11:11:10
        So you can in principle use the same setup in other cloud.

        [Fernando Harald Barreiro Megino] 11:11:13
        Providers. For example, I tried out the dependent part one time in in Oracle Cloud, just to see that it works.

        [Fernando Harald Barreiro Megino] 11:11:22
        Yeah, then in the next slide, please.

        [Fernando Harald Barreiro Megino] 11:11:27
        So one of the things that you can exploit on all of these commercial clouds, is all the different types of architectures that they have and that you don't always have on on grid sites.

        [Fernando Harald Barreiro Megino] 11:11:41
        One particular example is on Amazon. We were doing some arm testing.

        [Fernando Harald Barreiro Megino] 11:11:47
        So for this case it was Johannes and the teenager team that were trying to build the Athena simulation software for arm 64.

        [Fernando Harald Barreiro Megino] 11:11:58
        They had done the building, and they wanted to do a small physics.

        [Fernando Harald Barreiro Megino] 11:12:01
        Validation, or run a of a whole task. With that.

        [Fernando Harald Barreiro Megino] 11:12:04
        But there was not really any volunteer, any available grid site with arm resources that could set that up.

        [Fernando Harald Barreiro Megino] 11:12:12
        So what we did is we set it up in in Amazon with the cravat on 2 notes in the in the right side diagrams.

        [Fernando Harald Barreiro Megino] 11:12:25
        It's just the first validation that you honest it with 10,000 events.

        [Fernando Harald Barreiro Megino] 11:12:29
        And he compared the X 86 that had been executed in track.

        [Fernando Harald Barreiro Megino] 11:12:32
        I believe, against the arm, 64 on arm as an end.

        [Fernando Harald Barreiro Megino] 11:12:36
        It was it was matching quite well. And then, some weeks later, we prepared the full physics, validation with a 1 million events, and that was fully signed off few weeks ago.

        [Fernando Harald Barreiro Megino] 11:12:49
        So in principle, simulate simulation could be executed.

        [Fernando Harald Barreiro Megino] 11:12:57
        like in standard. Production. Now and I mean, we don't do this in particular for the cloud, but we do it more like it was discussed yesterday in the Hpc.

        [Fernando Harald Barreiro Megino] 11:13:06
        Session, where most of the next generation Hpcs are going to come up with a with more on Cpus and X.

        [Fernando Harald Barreiro Megino] 11:13:17
        86 is going to be dominant so it's a preparation for that.

        [Fernando Harald Barreiro Megino] 11:13:20
        Other things, so other exotic architectures or resources that can be Houston. Huh!

        [Fernando Harald Barreiro Megino] 11:13:27
        Thanks that can be used in in the cloud. For example, there is a user that is doing some trigger studies for a filter and there he's using Fpgas on Amazon or Johann for building the software for He uses very large notes on on Amazon and

        [Fernando Harald Barreiro Megino] 11:13:47
        Google and also Cpu stuff. Next slide, please, And if anyone has a question or comment while I'm going through the slice, you can interrupt me.

        [Fernando Harald Barreiro Megino] 11:14:01
        Now we come on, Google, just running Google as a great site, you can see 2 different approaches on the right top plot.

        [Fernando Harald Barreiro Megino] 11:14:14
        You can see how we were doing like scalar test.

        [Fernando Harald Barreiro Megino] 11:14:17
        That was done during the previous funding around until we were trying to see how how far we can scale it in a single client region, and we were getting to a 100,000 course in Europe West one P.

        [Fernando Harald Barreiro Megino] 11:14:35
        Which is an one of the one of the European regions, and if you would want to scale this out even more, you could replicate the setup to to whatever, to us to multiple regions in Europe, and so on and reaching a very high number of costs what we are doing now in

        [Fernando Harald Barreiro Megino] 11:14:56
        since it's fully worldwide Atlas Project is, we're running at the moment, a fixed, dry, a fixed size grid side.

        [Fernando Harald Barreiro Megino] 11:15:05
        We started with 5,000 calls, and we moved it to 10,000 costs.

        [Fernando Harald Barreiro Megino] 11:15:10
        It's exactly a month ago, and we can run any type of production.

        [Fernando Harald Barreiro Megino] 11:15:16
        we are not running analysis at the moment, because we need to reorganize the the storage.

        [Fernando Harald Barreiro Megino] 11:15:23
        In particular, we need a data discontent, separate, stretch disk so that user outputs don't end up in the same storage element.

        [Fernando Harald Barreiro Megino] 11:15:32
        but the other one Is this: The discrete site has worked very well.

        [Fernando Harald Barreiro Megino] 11:15:38
        It's very reliable, and also a very low error rate.

        [Fernando Harald Barreiro Megino] 11:15:41
        And when the the errors are usually very focused on particular situations, like, for example, I great to machines with low disk, or one at the time, there were issues with the with the phone tasks and I had, to fix that and our goal, is to do like a mix of

        [Fernando Harald Barreiro Megino] 11:16:02
        both both versions like, mix the the on demand fast scale out with a fixed size.

        [Fernando Harald Barreiro Megino] 11:16:12
        So we plan to run more or less. A a flat queue with 5,000 cores, and then on top run a dynamic queue, which processes urgency requests.

        [Fernando Harald Barreiro Megino] 11:16:24
        Or we are going to do something that we call the full chain where all of your steps in a in a simulation in our production, chain I run inside the same resource and you don't export the you only export the final in order to reduce the egress, cost

        [Fernando Harald Barreiro Megino] 11:16:48
        yeah, and the next slide. Thanks, Kenny. The other thing that we tried out is this analysis facility prototype?

        [Fernando Harald Barreiro Megino] 11:16:56
        what we wanted to do is like task scaling evaluations.

        [Fernando Harald Barreiro Megino] 11:17:03
        So we installed Twitter and task on on Google.

        [Fernando Harald Barreiro Megino] 11:17:08
        We integrated it with the Atlas Am. So in anyone from Atlas can connect without needing to question in particular new account or anything.

        [Fernando Harald Barreiro Megino] 11:17:20
        And then we have a couple of different options that the user can select for tasks.

        [Fernando Harald Barreiro Megino] 11:17:27
        You use this first light version, but then we also have machine learning images.

        [Fernando Harald Barreiro Megino] 11:17:31
        so that other people use tensorflow, and all those libraries, and you can also, if you want to put notebook with a cpu, and that will take a little moment to to put you need to provision the the machine.

        [Fernando Harald Barreiro Megino] 11:17:50
        You need to install Cvmfs and load Cmfs, and then added to the to the cluster.

        [Fernando Harald Barreiro Megino] 11:17:57
        That takes a couple of minutes. But then you have a notebook with a Gpu just for yourself, and you can without as long as you need, and for the task part which is, in my opinion, a very good example, for great scalar, for cloud scalability the right lower plot was

        [Fernando Harald Barreiro Megino] 11:18:20
        and look at Signage, who was trying out running the same task, but with a different number of workers.

        [Fernando Harald Barreiro Megino] 11:18:27
        So he ran first with 100 workers, and it took 40 min.

        [Fernando Harald Barreiro Megino] 11:18:30
        Then he re rerun the same task with the 200 workers, and the duration was half and so we'll until the the last part where he uses 1,500 workers and the task is done within just a few minutes and the the thing about this is that the cost on the cloud is

        [Fernando Harald Barreiro Megino] 11:18:51
        roughly the same, except for maybe they're just scaling or scheduling overhead.

        [Fernando Harald Barreiro Megino] 11:18:57
        but the cost is roughly the same. If you run way very few workers, and if you run with a lot of workers and for the use himself, it makes a lot of difference if he gets the results in 1 h or in 5 min, and yeah, we also should consider the in the cost the

        [Fernando Harald Barreiro Megino] 11:19:19
        calculation with the salary of the of the user himself, since he's optimizing his time. A lot.

        [Fernando Harald Barreiro Megino] 11:19:27
        Yes, and that's it. Can you? For next night.

        [Kenyi Paolo Hurtado Anampa] 11:19:32
        it's from then Yes, and so then for Cms again.

        [Kenyi Paolo Hurtado Anampa] 11:19:37
        This is what it was done a few years ago, and again, as I mentioned before, we did this by integrating cloud resources in one of the their sites at the Fermi website.

        [Kenyi Paolo Hurtado Anampa] 11:19:51
        via head cloud. So do you basically have a workflow injected.

        [Kenyi Paolo Hurtado Anampa] 11:19:55
        which is the resource provisioning trigger.

        [Kenyi Paolo Hurtado Anampa] 11:19:58
        This enters the facility interface which talks to the authentication and authorization mechanisms.

        [Kenyi Paolo Hurtado Anampa] 11:20:04
        Then the decision. There is a decision engine and a facility pull.

        [Kenyi Paolo Hurtado Anampa] 11:20:09
        There. And the decision engine basically talks to a provisioner that will be talking to the Microsoft.

        [Kenyi Paolo Hurtado Anampa] 11:20:16
        The same day cloud. And so this is basically a diagram of the head cloud architecture.

        [Kenyi Paolo Hurtado Anampa] 11:20:22
        What you have from there is basically going to restart this in the local sources in the cloud.

        [Kenyi Paolo Hurtado Anampa] 11:20:34
        So you have connecting to the HD. Calendar schedulers in the gliding wms in procedure, And that's How everything.

        [Kenyi Paolo Hurtado Anampa] 11:20:43
        Is connected in this case

        [Kenyi Paolo Hurtado Anampa] 11:20:53
        Okay. And the next part is Lans. You know. I think they're talking.

        [Dirk] 11:21:01
        yes, so Lanceium was already mentioned yesterday. It's it's an interesting new.

        [Dirk] 11:21:11
        Does that new company, and they because they they're not like a your traditional full service cloud provider that sell you that basically operate worldwide and give you anything you want in terms of capabilities And instance, types And whatever there they really geared towards utilizing low cost renewable energy

        [Dirk] 11:21:35
        to provide cheap compute, basically. And they're almost like a part of the business model is almost like an energy utility.

        [Dirk] 11:21:42
        basically they get money for for being able to low chat, and and they're they.

        [Dirk] 11:21:47
        They construct They're constructing the data centers right now in in in areas with that very high on renewable wind energy, and we we did a test a few months back where we integrated them into production.

        [Dirk] 11:21:59
        We run a few small workflows. It was all on free cycles as a as a test.

        [Dirk] 11:22:05
        Basically they they have a bit different than aws. Google. They only support singularity containers, not vms.

        [Dirk] 11:22:11
        And what we did is we just ran a pilot job in the singularity container, and then then the pilot itself is just to stand that Cms pilot.

        [Dirk] 11:22:21
        So it runs our payloads in in in a nested singularity container, you know.

        [Dirk] 11:22:26
        Cvm. Fs and local squid were provided from Nancy.

        [Dirk] 11:22:30
        We we work with them on that. They currently don't have any local managed storage.

        [Dirk] 11:22:34
        Just so job scratch. So in, And we basically run these resources like we do Opportunistic was G.

        [Dirk] 11:22:39
        Or Hbc. Site set where we don't use manage to storage.

        [Dirk] 11:22:42
        We just used triple a reads to get the input and then stage out to formula.

        [Dirk] 11:22:48
        So that's that covers the runtime.

        [Dirk] 11:22:50
        The provisioning integration is another problematic area potentially for long term, because they have a custom api, which is not compatible with aws or Google I mean it already They're running singularity containers so you need some way to start up.

        [Dirk] 11:23:05
        A container, and what we're doing right now is, we're just using vacuum provisioning.

        [Dirk] 11:23:10
        So we just when we run to run a test, we just start up a container manually when needs.

        [Dirk] 11:23:15
        And that's relatively simple through the Api, because the Api is just.

        [Dirk] 11:23:19
        You can. You can run like some script that that call out to the Python Api.

        [Dirk] 11:23:25
        Call out to the Api and tell it how many containers are running.

        [Dirk] 11:23:28
        If it's less than 10, you bring it up to 10 So that's that's that's basically the level of integration that we have right now.

        [Dirk] 11:23:36
        So we think it's interesting enough. It will cry a little bit of work to to get really working to get it fully integrated.

        [Dirk] 11:23:44
        But we're working on with with Lansing pretty occurring A small number of cycles for more tests.

        [Dirk] 11:23:51
        Is it? The plan is maybe to to get some cycles there and then see if we can.

        [Dirk] 11:23:57
        When there's particular load from Cms specifically on Fermi app, that we can say, Okay, we bring up lensing resources and that freeze up resources at formula to do stuff that is most suited to a tier.

        [Dirk] 11:24:10
        One

        [Enrico Fermi Institute] 11:24:28
        just to get in there for a second in the case of all of this, this is very much oriented around production jobs. Do you think we could organize sometime in you know, the next year or something like that trying to interface this with either coffee Casa or the elastic analysis facility?

        [Enrico Fermi Institute] 11:24:45
        Effort to see if we can gain more flexibility for more bursty analysis jobs, much like what Atlas was doing with Google Cloud and whatnot

        [Dirk] 11:24:56
        we could try I mean the

        [Enrico Fermi Institute] 11:24:58
        The security is gonna be a nightmare at elastic analysis.

        [Dirk] 11:25:03
        Yeah, the the the thing is it? It really depends how well everything plays together with the provisioning integration.

        [Enrico Fermi Institute] 11:25:10
        Facility for sure

        [Dirk] 11:25:10
        I mean, they have a simple Api. They just pass it.

        [Enrico Fermi Institute] 11:25:10
        Yeah.

        [Dirk] 11:25:13
        You basically you need a token associated with your account, and then you have, a if you're a single python script, like a monolithic python script that they give you where you can tell it to start a container and bring up something so so it's it's

        [Enrico Fermi Institute] 11:25:25
        Okay.

        [Dirk] 11:25:27
        relatively simple, so sure, I mean, we can look at it It's it's a matter of Do you want to do it?

        [Enrico Fermi Institute] 11:25:32
        I mean

        [Dirk] 11:25:35
        Do you want to do tests, or you want to do it for real?

        [Dirk] 11:25:37
        Because when if you do it for real, then you actually need to have a paid for a number of cycles sitting there that you can use for tests, we can just go whenever

        [Enrico Fermi Institute] 11:25:46
        Yeah, I I think we would need to get the facilities at the or the at least the one affirmative action, as it is right now and then go for a more for real test with people's actual, analysis.

        [Enrico Fermi Institute] 11:25:57
        Jobs. Once we have that set up. I think that would be the better way to see how this actually works.

        [Enrico Fermi Institute] 11:26:06
        So this is like a year, time scale, or something like that.

        [Enrico Fermi Institute] 11:26:09
        Your analysis, facility. Do you have any implicit dependencies on shared file systems, or anything like that? Or does everybody because we're at or because we're a fermi, lab we're restricted from using shared shared file systems aside from like, X d And stuff, Okay, yeah, I was gonna

        [Enrico Fermi Institute] 11:26:24
        say that might be 1 One challenge. Is stretching out to right is, How do you structure file system out there?

        [Dirk] 11:26:29
        They.

        [Enrico Fermi Institute] 11:26:29
        Exactly, but thankfully. We've already been forced to solve that

        [Dirk] 11:26:33
        Maybe because Lindsay just mentioned a year, the the time horizon on that at the currently Lens Zoom.

        [Dirk] 11:26:41
        As I said, the young company need a starting up. They're kinda still building the data sent us like the the, the So they have a test data center.

        [Enrico Fermi Institute] 11:26:45
        Hmm.

        [Dirk] 11:26:51
        That's in Houston, which is not really using renewable energy.

        [Dirk] 11:26:54
        But where they basically just deploying the the the whole hardware software integration that they're working with, and that's what we've been testing, what they're building right now, and which is supposed to come online at some some point later, this year, or early next year, i'll leave really the big data centers

        [Dirk] 11:27:08
        which are co-located to like wind, energy, hotspots, and and and Texas.

        [Dirk] 11:27:15
        There's not much else there, but they're building a data center and they'll be the interesting.

        [Dirk] 11:27:20
        Ones basically because that's real, renewable in a jet.

        [Dirk] 11:27:22
        This lots and lots of basically power capacity there. And they they call more importantly, they're gonna connect them to a 100 gigabit to the Us.

        [Dirk] 11:27:33
        Net and everything else

        [Enrico Fermi Institute] 11:27:34
        Okay, they're actually going to appear to. They're going to actually can connect Peer with the snap for sure.

        [Dirk] 11:27:41
        That's what their plan is, because they're they are kind of pushing They're they're basically making the sales pitch hard to academic users.

        [Dirk] 11:27:51
        I mean, I've seen talks for them on Scc. Os was G.

        [Dirk] 11:27:56
        Did they basically travel around Europe Because for Europe it's it's like running compute on cheap powers and even bigger concern right now than in the Us.

        [Dirk] 11:28:06
        because power prices there traditionally have been much higher.

        [Dirk] 11:28:09
        And now I extremely much higher than than in the Us.

        [Enrico Fermi Institute] 11:28:12
        But they're gonna connect to Esnet versus Internet to

        [Dirk] 11:28:17
        Probably I mean they. They said. They're they're really. They.

        [Dirk] 11:28:20
        They basically point out, that's that's another of this selling points.

        [Dirk] 11:28:24
        They point out that they want to not charge, for egress.

        [Dirk] 11:28:32
        So like yeah, like not charging for egos in good network.

        [Dirk] 11:28:39
        Integration for academic workloads seems to be they're they're they're focusing on that, I mean, because I mean, you have to look at what the they are low quality of service.

        [Dirk] 11:28:50
        Somewhat by design. They're not like Amazon ready, Sell you to give you a Vm.

        [Dirk] 11:28:55
        And promise you 99 point whatever. David. Then Zoom tells you.

        [Dirk] 11:28:59
        If the if the room, there's no wind, we're gonna load chat like crazy.

        [Dirk] 11:29:04
        So we're gonna evacuate you, and that's that's fine.

        [Dirk] 11:29:08
        but that also means that they have to have other selling points, because and other target markets, because they're not gonna attract the like the financial sector or industry.

        [Dirk] 11:29:18
        That wants like high up time, compute service, sitting somewhere

        [Ian Fisk] 11:29:22
        But but the turkey thing was, is net not Internet too right?

        [Dirk] 11:29:26
        I'm I'm not exactly sure they They basically we had.

        [Ian Fisk] 11:29:28
        Oh sure!

        [Dirk] 11:29:30
        And and you have to remember the discussions we had with them.

        [Ian Fisk] 11:29:32
        Right.

        [Dirk] 11:29:34
        But most months before these data, they're still under construction, so I don't think the network is connected yet.

        [Ian Fisk] 11:29:38
        Right. The only reason I ask is that it is one of the things that yes, Net Charter is very is relatively strict, and it will allow you to connect formula or Bnl or cern to land team resources but for instance, it won't carry from a university so one of the

        [Dirk] 11:29:39
        Nope.

        [Ian Fisk] 11:29:58
        endpoints needs to be under the Es net charter, which limits we can be a little limiting

        [Dirk] 11:30:04
        I mean at the moment I would. Read It as a a statement of intent that they want to do everything they can on the network integration side to make it easy for us to to to use their facilities and then what's actually deployed on how things are connected I think we have to wait for for these data

        [Ian Fisk] 11:30:08
        Okay.

        [Ian Fisk] 11:30:15
        Right.

        [Ian Fisk] 11:30:22
        Right, and it could It's fine that that matches under the charter.

      • 10:20
        Cost 20m

        Effort
        Support infrastructure
        Outlook

        HP Cost

        [Eastern Time]

         

        [Kenyi Paolo Hurtado Anampa] 11:30:50
        yes. So the next section in this license about cloud costs, and just like yesterday they green.

        [Kenyi Paolo Hurtado Anampa] 11:31:04
        Here means this is one of the questions that we need the from the charge that we need to answer.

        [Kenyi Paolo Hurtado Anampa] 11:31:08
        And it's basically what is the total cost of operating commercial power, resources collaboration workflows.

        [Kenyi Paolo Hurtado Anampa] 11:31:15
        And this is mostly focused on production workforce for the charge, both for computer resources as well as the operational effort for the Lc.

        [Kenyi Paolo Hurtado Anampa] 11:31:24
        Around 3, and so with that we will start with, what's the experience software in the cost for Atlas and Cms? So for another, do you Wanna take over from here?

        [Fernando Harald Barreiro Megino] 11:31:38
        yes, so I mean the content of this light is mostly my opinion and experience.

        [Fernando Harald Barreiro Megino] 11:31:47
        and now just to note that with with this Atlas covid project there will be a dedicated Tc.

        [Fernando Harald Barreiro Megino] 11:31:59
        Also total cost of ownership board that will then study the costs detail.

        [Fernando Harald Barreiro Megino] 11:32:08
        so to explain a little bit the the cost model, so that you have in the cloud.

        [Fernando Harald Barreiro Megino] 11:32:13
        So for the computer are different levels of of the virtual machine.

        [Fernando Harald Barreiro Megino] 11:32:20
        So you have the reserved instances What you say basically I will.

        [Fernando Harald Barreiro Megino] 11:32:24
        I reserve for so many cpus for a year, and but that also means that you are stuck for the full year with those reserved instances, and there is no real elasticity.

        [Fernando Harald Barreiro Megino] 11:32:38
        Then there is the on demand which is the appetite for for the on-demand virtual machines, where, when you want to request, a virtual machine, and then, once you have it that way too much, for you for like forever and then the lower t of the on the month is the spot

        [Fernando Harald Barreiro Megino] 11:32:58
        where you will request a virtual machine, and then can get it, and you can also.

        [Fernando Harald Barreiro Megino] 11:33:06
        It can also be taken away from you whenever Google needs it.

        [Fernando Harald Barreiro Megino] 11:33:09
        For someone else, or it can be just their cute and put somewhere else if they need to do some optimization within their computing center.

        [Fernando Harald Barreiro Megino] 11:33:24
        Yeah, it's I think it's 37. So it's not.

        [Fernando Harald Barreiro Megino] 11:33:31
        It's nothing, but we can in practice work with.

        [Fernando Harald Barreiro Megino] 11:33:34
        So if you get the kill signal, you lost the whatever, so was running on the Vm.

        [Fernando Harald Barreiro Megino] 11:33:40
        Until then.

        [Fernando Harald Barreiro Megino] 11:33:44
        but the experience with Spot. It's quite good, in my opinion, because the in Google in particular also have only the preemptable vms which had the maximum lifetime of 24.

        [Fernando Harald Barreiro Megino] 11:33:59
        Hours and now they've stopped that model, and they've moved to Spot, and there you can have the the virtual machines for a long time.

        [Fernando Harald Barreiro Megino] 11:34:07
        and I don't see a significant amount of failed, wasted world clock time because of spot and got this.

        [Fernando Harald Barreiro Megino] 11:34:16
        Well, you will see it later, but it's like 60% cheaper than on the month.

        [Fernando Harald Barreiro Megino] 11:34:22
        Then for the storage you have also different categories.

        [Fernando Harald Barreiro Megino] 11:34:27
        There is the Standard number line call line, like the the time to access your files.

        [Fernando Harald Barreiro Megino] 11:34:33
        Is the same with all of them, with with the th.

        [Fernando Harald Barreiro Megino] 11:34:38
        The Then you go to the right. You need to keep the data on the storage for longer, So I think, Neil, and you need to keep it 30 days cold. Line.

        [Fernando Harald Barreiro Megino] 11:34:51
        I don't know how many days I'm so.

        [Fernando Harald Barreiro Megino] 11:34:52
        Also the the question that you get, the more you pay by the access.

        [Fernando Harald Barreiro Megino] 11:34:58
        So in practice for the storage, we are always using standard the standard class.

        [Fernando Harald Barreiro Megino] 11:35:05
        So that's a traditional course model. Now, there is a new cost model.

        [Fernando Harald Barreiro Megino] 11:35:09
        That is what we are using in, and it's this: Us public sector subscription agreement with, Google And they're basically, if you are university or or a lot from the public sector, you can negotiate a fixed price for your computing needs So agreement

        [Fernando Harald Barreiro Megino] 11:35:31
        for 10,000. See total cpus and 7 petabytes of storage, and it's for $600,000 for 15 months.

        [Fernando Harald Barreiro Megino] 11:35:48
        and you pay that amount, and you don't need to worry.

        [Fernando Harald Barreiro Megino] 11:35:51
        If then you have more egress or less ears to optimize your, you're agreement, obviously.

        [Fernando Harald Barreiro Megino] 11:35:58
        And this protects you from surprises Under the end of your 15 months.

        [Fernando Harald Barreiro Megino] 11:36:04
        I guess it will be a review like while you, your egress, is out of control to try tomorrow or not, or maybe they are happy with the situation, and they get renegotiated and I don't want to talk about the exact amount of dollars that we have in the in our agreement.

        [Fernando Harald Barreiro Megino] 11:36:24
        But I just want to say that it's very favorable, and it's lower than that this prices the under thing to consider in in these clouds is that the resources are very elastic.

        [Fernando Harald Barreiro Megino] 11:36:41
        So it's a bit what they try to show with in the task Example: the cost.

        [Fernando Harald Barreiro Megino] 11:36:45
        For 10,000. Cpus for and 1 h. It's the same as the cost of for one Cpu.

        [Fernando Harald Barreiro Megino] 11:36:50
        For of 10,000 h, so you can run the real estate without a major cost increase, and also, since it's the last like, if you ramp down, you don't need to keep any other resources, and you just really pay for what you use and from an operational perspective.

        [Fernando Harald Barreiro Megino] 11:37:17
        in in my opinion, it's very low cost. There is, I mean, the whole development setup on operation of the of the Lutheran part was done by one of the route experts and also all of the development setup and operation for the Underpassed by one panel expert fractions, of the of

        [Fernando Harald Barreiro Megino] 11:37:38
        the right, And I also think that here this model is like really pure devops for the most pure form of it.

        [Fernando Harald Barreiro Megino] 11:37:49
        You operate aside. You see, I'm you also learned things that are not good for the site that you can improve in in Fanda or in harvest.

        [Fernando Harald Barreiro Megino] 11:37:59
        And then you go and change those things, and also with the same amount of Fd.

        [Fernando Harald Barreiro Megino] 11:38:08
        Resources like, if I run a 10,000 core. Cluster, 30,000 core class that doesn't really make up difference.

        [Fernando Harald Barreiro Megino] 11:38:18
        I'm now moving to the plot on the right, so here what I'm showing is the all of the pins, except the last one to the right are simulations using the cost you later.

        [Fernando Harald Barreiro Megino] 11:38:34
        the first one are on Amazon, the second ones are on Google. Good.

        [Fernando Harald Barreiro Megino] 11:38:38
        and I didn't average of the Usda tools and play so that there are in average it's 10,000 data cpus.

        [Fernando Harald Barreiro Megino] 11:38:48
        Of course, 7 petabytes of of storage

        [Fernando Harald Barreiro Megino] 11:38:55
        I don't know what she keeps writing, and then also an average.

        [Fernando Harald Barreiro Megino] 11:39:04
        There was 1 point, 5 petabyte of egress per month.

        [Fernando Harald Barreiro Megino] 11:39:06
        I looked it up in the dtmashboard, and then I went to the Google price calculator and I using different types of oh Pms.

        [Fernando Harald Barreiro Megino] 11:39:17
        I calculated the cost. The blue part is the the cpu.

        [Fernando Harald Barreiro Megino] 11:39:21
        The red part is the storage. 7 petabytes, and the yellow part is the 1.5 kB of egress per month, and then well so depending on what the type of compute you use you pay you can reduce it so this is the first one is on

        [Fernando Harald Barreiro Megino] 11:39:38
        the demand. Second one is if you pay one tier upfront on Amazon, then if you reserve for a year on Amazon.

        [Fernando Harald Barreiro Megino] 11:39:46
        But you don't pay upfront, and it's a little bit more expensive.

        [Fernando Harald Barreiro Megino] 11:39:49
        And then you reserve for 3 years, and then you see that the price starts dropping considerably, and the last one for Amazon is the Amazon spot, and you see that the the Cpu part is really much lower than the which is the Amazon on demand.

        [Fernando Harald Barreiro Megino] 11:40:08
        Then, if we move to the Google part, Google is a little bit, She preferred the Cpu in at least for for the calculations that I did for the egress and the storage, is more or less the same, as amazon and then the very last pin that I took on

        [Fernando Harald Barreiro Megino] 11:40:30
        the billing report of of the Google Cloud console for the last 30 days under extracted how much we have been spending on each one of the things and to compare it with what I had done in my theoretical calculations the Cpu was a little bit cheaper So

        [Fernando Harald Barreiro Megino] 11:40:50
        we use Spot. So you have to compare it with the Gcp.

        [Fernando Harald Barreiro Megino] 11:40:53
        Spot. It's a little bit cheaper. Also I didn't use the full 10,000 cpus, but only in my 2 9,200 for the story to It's much cheaper than the others, but also we don't have the 7 petabytes. Of data.

        [Fernando Harald Barreiro Megino] 11:41:12
        Yet We have only 1.6 bit byte. So that's that explains it.

        [Fernando Harald Barreiro Megino] 11:41:16
        And then egress. We are at. We did 1.2 Pedros of E address, according to the Tcp Billing, which is very close to what I had gotten in my models?

        [Fernando Harald Barreiro Megino] 11:41:28
        and so that's what you would be paying if you would pay list prices.

        [Fernando Harald Barreiro Megino] 11:41:33
        But again in our user agreement, it's the the what we would keep paying effectively. It's it's lower than that on the this is it for this night.

        [Fernando Harald Barreiro Megino] 11:41:46
        I see there are Yeah.

        [Paolo Calafiura (he)] 11:41:52
        quick, quick, question actually It's a call man by the question.

        [Paolo Calafiura (he)] 11:41:57
        the subscription price is, of course, very advantageous, but it does.

        [Paolo Calafiura (he)] 11:42:03
        It does kind of remove the elasticity you mentioned, because you know, if you use one Cpu for 10,000 h, you are not using your subscription very well at all, so that that that, was the only comment that I wanted to make and then I and follow our question is anyone on top with with

        [Fernando Harald Barreiro Megino] 11:42:16
        Hmm.

        [Paolo Calafiura (he)] 11:42:22
        Amazon about a model similar to this Google subscription?

        [Fernando Harald Barreiro Megino] 11:42:28
        So about elasticity, not completely because the agreement is for 10,000 little cpus in average, so you could be using one month, 5,000 on the next month, 15,000 But yeah.

        [Fernando Harald Barreiro Megino] 11:42:46
        If you arrive on the last day, and want to use your average 10,000 digital Cpu: so 15 months on the last day, that will be very difficult.

        [Fernando Harald Barreiro Megino] 11:42:57
        But the zoom, with your your resources, So there's some elasticity there is, and about the Amazon question, I I don't.

        [Paolo Calafiura (he)] 11:42:57
        yeah, I did.

        [Kaushik De] 11:43:19
        yeah, we we. We have not had the conversation with Amazon with Amazon.

        [Kaushik De] 11:43:24
        We only use credits in the old traditional way.

        [Kaushik De] 11:43:33
        So in some sense it's good because we have the side by side comparison with Amazon via I'll set them out to fix credit.

        [Chris Hollowell] 11:43:56
        I yes, it. You know from my experience, the the a lot of the Cloud providers.

        [Chris Hollowell] 11:44:01
        They're not really guaranteeing a specific Cpu model.

        [Chris Hollowell] 11:44:05
        It's sort of nebulous what cpu they're they provide.

        [Chris Hollowell] 11:44:09
        So I mean, I guess the question is, you know, so you have 10,000 chorus.

        [Fernando Harald Barreiro Megino] 11:44:21
        they do not, they do not tell you exactly what's the Cpu model, but some family.

        [Fernando Harald Barreiro Megino] 11:44:31
        So, for example, I used the N. 2, and that is cause Kate Link or Ice Lake, I think, would I'm not the Cpu expert but th th those are for Google the newer generations.

        [Fernando Harald Barreiro Megino] 11:44:44
        And if you take the n one you go to the order generations.

        [Fernando Harald Barreiro Megino] 11:44:46
        So, yeah, it's you. You don't. Yeah, you're you're more or less right.

        [Fernando Harald Barreiro Megino] 11:44:55
        That don't know exactly what the Cpu. That's an approximation

        [Steven Timm] 11:44:59
        Oh, you think hmm

        [Enrico Fermi Institute] 11:45:00
        Do they not expose anything into the in the Os

        [Enrico Fermi Institute] 11:45:05
        Okay.

        [Steven Timm] 11:45:06
        So it Yeah, I had my student actually run this house for most of the new Google instances this summer.

        [Steven Timm] 11:45:18
        I have the numbers, we we we've got most of the Google aspects available.

        [Steven Timm] 11:45:22
        I think we want some

        [Fernando Harald Barreiro Megino] 11:45:24
        I would be interested in having that

        [Chris Hollowell] 11:45:26
        right, right.

        [Steven Timm] 11:45:37
        okay.

        [Chris Hollowell] 11:45:39
        I guess the issue, there though is, since they're not guaranteed any Cpu model.

        [Chris Hollowell] 11:45:44
        In particular, there would change

        [Enrico Fermi Institute] 11:45:57
        Here we had a comment from Dirk

        [Dirk] 11:46:04
        one was about the elasticity which was already covered, so it seems to be possible within within limits.

        [Dirk] 11:46:12
        But you probably, if you have a 10,000 average, you can't run 120,000 like one month, and nothing the rest of the year.

        [Enrico Fermi Institute] 11:46:16
        Thanks.

        [Dirk] 11:46:18
        That's probably not gonna fly. Let's see.

        [Dirk] 11:46:22
        Seems to me but the the other one was, I think, I mean, we talked about these pricing plots a a a bit.

        [Dirk] 11:46:28
        I think I finally I think, understood what that last bomb means.

        [Dirk] 11:46:35
        So that's from the within the subscription. That's from the running counter insight.

        [Dirk] 11:46:40
        So it's in some sense fake pricing, right?

        [Dirk] 11:46:42
        It's because you you pay the subscription price, but they still tabulate what things cost?

        [Fernando Harald Barreiro Megino] 11:46:48
        yeah. So with the subscription, what? And they are doing is all the time filling up our credit.

        [Dirk] 11:47:01
        Quote unquote.

        [Dirk] 11:47:01
        Okay.

        [Ian Fisk] 11:47:11
        Yup, my! It was a question actually which was It is as interesting as to see the various models between the 2 great cloud providers.

        [Ian Fisk] 11:47:19
        Has anyone done an updated what it's actually costing us to host these things?

        [Ian Fisk] 11:47:24
        Because I'm looking at these numbers, and I I know the size of the facility that we run versus the cost of the hosting and the operations. And these numbers are dramatically higher than we're paying

        [Ian Fisk] 11:47:44
        So I'm putting in that right bye, bye, I

        [Ian Fisk] 11:47:49
        Okay, I am, including in that price the cost of the hosting.

        [Ian Fisk] 11:47:53
        So what it costs to rent the space, to power, the machines to buy the machines, to operate, the machines, to administer the machines and support people, using the machines

        [Fernando Harald Barreiro Megino] 11:48:07
        but then everything that is installed on top is, it's not included this.

        [Ian Fisk] 11:48:11
        Oh, really! What what, What do you mean? What's installed on?

        [Fernando Harald Barreiro Megino] 11:48:16
        All the services that you are running

        [Ian Fisk] 11:48:17
        I am, including all like what I mean, services like the back system, and the oh, putting, including all of that, too, putting all of those things

        [Ian Fisk] 11:48:28
        So I'm including including the 15 person staff that we run the place, who add, plus the cost of the hosting, the facilities plus the cost of operating the storage, external networking, etc.

        [Fernando Harald Barreiro Megino] 11:48:40
        I mean the list prices in particular. If you go on the month.

        [Fernando Harald Barreiro Megino] 11:48:45
        so I've been told not to compare. I I I compared it myself with a Usda tool, and if you use on demand instances, it they considerably higher our subscription agreement is very similar to our usda 2 without saying How much we used to costs.

        [Ian Fisk] 11:49:01
        Right.

        [Fernando Harald Barreiro Megino] 11:49:17
        because it Oh, the one because because it creates a conflicts and fights.

        [Enrico Fermi Institute] 11:49:26
        no, but as a quality of service for the the If you're going to compare it to a tier, 2, right is the quality of service that you have to provide for the storage.

        [Enrico Fermi Institute] 11:49:38
        The same on Google Cloud as it is for midwest tier, 2, for for example.

        [Fernando Harald Barreiro Megino] 11:49:45
        I mean

        [Enrico Fermi Institute] 11:49:46
        Because that has an operate that has an operational effect

        [Fernando Harald Barreiro Megino] 11:49:51
        I mean. My opinion is that the quality of service in Google is something that no, I mean, they simply have thousands of.

        [Enrico Fermi Institute] 11:50:02
        No, I mean the Atlas service, the Atlas services that are running the Rc.

        [Fernando Harald Barreiro Megino] 11:50:02
        So the quality of yeah.

        [Enrico Fermi Institute] 11:50:08
        That runs at Google right? Can it, for instance, be a nucleus so that it can serve out data and stuff like that?

        [Enrico Fermi Institute] 11:50:13
        That's what I mean by quality of service not their underlying layer.

        [Enrico Fermi Institute] 11:50:16
        That's that's all good. What I really mean is the Lwlcg layer of services and code that have to run on it to to have it.

        [Fernando Harald Barreiro Megino] 11:50:16
        So

        [Enrico Fermi Institute] 11:50:27
        Behave like a typical tier. 2 grid site

        [Fernando Harald Barreiro Megino] 11:50:30
        I I've been working on the panel site, and that works as good as any It was tier, one or room I mean.

        [Fernando Harald Barreiro Megino] 11:50:40
        It's got to know if better, because I don't look that much.

        [Fernando Harald Barreiro Megino] 11:50:43
        I did it. Other sites, but it's completely flat, so there is never wasted cause.

        [Fernando Harald Barreiro Megino] 11:50:49
        The failure rate is very, very low, and when there is a value rate it's usually done.

        [Fernando Harald Barreiro Megino] 11:50:57
        It's usually cost to by misconfiguration. Just that.

        [Fernando Harald Barreiro Megino] 11:51:01
        I don't have the. It's new. We're running it since a month.

        [Fernando Harald Barreiro Megino] 11:51:07
        And, for example, I underestimated the disk. Things like that this and half a year in my the pandemic will run as good or pay.

        [Ian Fisk] 11:51:21
        I I guess I do. I would I just go back to my point, which I think is important for the report to follow is that we're always in a situation where we're making a choice in terms of how we allocate the resources and we're always having to cut back on something else to afford so

        [Ian Fisk] 11:51:33
        that so in in some sense, we're at some point, we're gonna have to make an argument that says, using the cloud is less expensive by so metric.

        [Enrico Fermi Institute] 11:51:53
        Seekashka has had his hand raised for a while to jump in

        [Kaushik De] 11:51:57
        yeah. So I wanted to address 2 of the points that we have had extensive discussion on.

        [Kaushik De] 11:52:07
        One is the elasticity doesn't seem to care.

        [Kaushik De] 11:52:11
        They're perfectly fine. If you want to use 100,000 cores for one month instead of 10,000 cores, how the duration of the project we are planning to test that when we moved the later parts of our planned program of work And are in these studies with Google But we certainly

        [Kaushik De] 11:52:30
        plan to test both models. The only reason why we started with the flat model is because that's what our current computing systems are designed for, And we wanted to give that a quick. Test.

        [Enrico Fermi Institute] 11:52:39
        Okay.

        [Kaushik De] 11:52:48
        We don't have to continue this way. We could not run anything for 3 months, and then we could run 5 times higher for a month.

        [Kaushik De] 11:52:56
        It's it's completely elastic up to the the limits of the resources of the data center.

        [Kaushik De] 11:53:03
        And then, of course, one can scale up by going to multiple data centers.

        [Kaushik De] 11:53:06
        So that's the elasticity issue. Even with the subscription model, because we have discussed it with them.

        [Kaushik De] 11:53:14
        I think the cost, comparison issue. I think it's an important one, but I think we have to be a little bit careful, because we will never come to a conclusion if we ask Google or the team that's using Google to come up with the cost off.

        [Kaushik De] 11:53:31
        A T. I wanna tear to side, I mean, that's just will never work.

        [Kaushik De] 11:53:35
        You know that it will never work, because every time somebody from the outside tries to value it the cost of.

        [Enrico Fermi Institute] 11:53:38
        Me.

        [Kaushik De] 11:53:42
        Do you want N. 2 to decide? It will be. There will be something that people will find to say that it was not done correctly.

        [Kaushik De] 11:53:53
        So. I think it's that you're one, and 2 sites who actually truly have to do the costing, and they actually have to do the comparison.

        [Kaushik De] 11:54:03
        And they actually have to decide what is best for them to have on prem resources or offering resources.

        [Kaushik De] 11:54:09
        And in what particular combination do they want to do it?

        [Kaushik De] 11:54:12
        I think it's up to the tier. One; and tier 2 sites.

        [Kaushik De] 11:54:16
        It's not up to the people who Alright, using Google and Amazon And it's certainly not up to the salespeople from Google and Amazon to tell us how to they can do it cheaper all.

        [Kaushik De] 11:54:25
        We can do. And I think that's what we are focused on doing.

        [Kaushik De] 11:54:28
        And I think that's really last video updates, 2 plots.

        [Kaushik De] 11:54:32
        Here is what is the cost of doing this and that on Google and Amazon.

        [Kaushik De] 11:54:39
        And I think that's how we make progress. Is we alright at as transparent as possible, as many different kinds of tests as possible.

        [Kaushik De] 11:54:48
        Explore all the possibilities that we can do.

        [Kaushik De] 11:54:54
        And then we an experimentalist. We do that through this project, over the next 15 months, and then we provide that information Then it is up to tier one tier.

        [Kaushik De] 11:55:05
        2, since people of various kind to come and argue this way and that way, and I don't think we should be as a technical thing, we should be part of that

        [Enrico Fermi Institute] 11:55:15
        No; but you have the caution. Yeah, if hold them the lost capabilities.

        [Enrico Fermi Institute] 11:55:23
        And so I'll use an example as a lot of the engineering was pulled out of the physics departments and went to the National Labs University groups lost capabilities.

        [Enrico Fermi Institute] 11:55:34
        They couldn't do certain things on detector projects. This will be the same.

        [Enrico Fermi Institute] 11:55:39
        This we have to quantify that effect? If you were to, for instance, move all the compute to the clown, what would we lose

        [Kaushik De] 11:55:46
        I complete. I completely agree with you, but those are not part of it.

        [Kaushik De] 11:55:50
        Technical study of what we can do on Google and Amazon.

        [Kaushik De] 11:55:54
        Those are really discussions within the field of how we move our field forward.

        [Kaushik De] 11:55:59
        I I think we should separate the 2. I don't think we should mix up the 2.

        [Kaushik De] 11:56:02
        I think we should look at the quality of service. I think we should look at the type of service.

        [Kaushik De] 11:56:07
        I think we should look at the services that are actually global and and and provide that we look at the past.

        [Kaushik De] 11:56:17
        That's the scope of what we're doing beyond that.

        [Kaushik De] 11:56:21
        Of course, is up to the field to decide

        [Enrico Fermi Institute] 11:56:24
        But even in the technical cost thing you, because of the hour, provided labor to it.

        [Enrico Fermi Institute] 11:56:29
        Don't we also have to capture good? This? The labor needed to to have the same quality of service as same from the experiment, I'd say typical tier 2

        [Eric Lancon] 11:56:51
        yes, so I wanted to come back on a few statements which were made, I think we're we need to be very careful about the general statement, like it's cheaper than a tier 2 this state those statement.

        [Eric Lancon] 11:57:15
        Do not represent us at last; and as you should be indicated on the slides, if they are such that statements there, there is a working group with and Atlas which is being set up to supposedly look at the Tco for operating on the Cloud and the Tier 2 so may want.

        [Eric Lancon] 11:57:33
        to to wait for the conclusion of this working group. What I would like to say is that that Will Kevin?

        [Eric Lancon] 11:57:42
        Well, very well aware of the cost of cloud compared to on-site operation, because for any big investment we platform comparison of the cost.

        [Eric Lancon] 11:57:55
        Okay, the on the cloud, including the feminist Google, Do we discount which is being used by Atlas?

        [Eric Lancon] 11:58:05
        And we have, as it was noticed by Yan Fisk.

        [Eric Lancon] 11:58:12
        the cost, really prohibitive. I cannot give you exact numbers, because we cannot provide the touches cost.

        [Eric Lancon] 11:58:23
        But you know really much slower than any solution which is available on the cloud.

        [Fernando Harald Barreiro Megino] 11:58:43
        regarding your first comment. No one or I didn't hear anyone saying that this is cheaper than Usdo.

        [Fernando Harald Barreiro Megino] 11:58:52
        I don't know where you got. No, I said, It gets similar with the with the user subscription

        [Enrico Fermi Institute] 11:59:01
        Okay.

        [Fernando Harald Barreiro Megino] 11:59:05
        Well, okay. In in any case, explicitly, I didn't put a usual cost, and for the Tcr.

        [Fernando Harald Barreiro Megino] 11:59:13
        it's what I said at the very beginning, but the Tc.

        [Paolo Calafiura (he)] 11:59:27
        yeah, I won't. I won't want to want the what. Oh, sorry me.

        [Paolo Calafiura (he)] 11:59:35
        I didn't see the way ends. Can I go

        [Paolo Calafiura (he)] 11:59:39
        I apologize. So then I want to make a comment, which is that one thing we have to keep in mind is this derivative?

        [Paolo Calafiura (he)] 11:59:50
        And so the the costs, the comparison with cost of proud to do on the resources, was just 3 close The first time we did it, which was about 2,016.

        [Paolo Calafiura (he)] 12:00:04
        I mean it was like order of money. It's more expensive.

        [Paolo Calafiura (he)] 12:00:07
        And while I agree with Eric, that is, that is not yet done at the cost comparison.

        [Paolo Calafiura (he)] 12:00:14
        And now it's actually it's actually what to do in the cost comparison, and probably it will come.

        [Paolo Calafiura (he)] 12:00:20
        It will come still more expensive On the cloud side than than the own side, but not by a factor of 10.

        [Paolo Calafiura (he)] 12:00:27
        So I think one of the important roles of these investigations is to be ready to be ready in case, for some reason, Google Cloud or aws, they, they can buy cpu and storage at prices We Don't.

        [Paolo Calafiura (he)] 12:00:44
        Have access to So let's not just see it, as is a short term effort.

        [Paolo Calafiura (he)] 12:00:52
        But there's an effort which is thinking about what's gonna happen in 5 years.

        [Eric Lancon] 12:00:55
        no, I agree. I agree. I agree, Paula. We should keep a close eye on the cost, and if the services for for equivalent level of service cheaper on the cloud, we should consider going to cloud solution for some of the application.

        [Steven Timm] 12:01:25
        So we have a couple of comments. One is that go 2.

        [Steven Timm] 12:01:29
        3 years ago, just before Covid. It was a very big study done. It's for me.

        [Steven Timm] 12:01:35
        I tried to. What would it cost to run the Ruben Data Center here as opposed to ringing on Cloud and tried to.

        [Steven Timm] 12:01:42
        And we tried to cost that all. Okay, do not exactly know all the numbers there, but there was a very comprehensive study that was done, and that's sort of that's A data point.

        [Steven Timm] 12:01:51
        Yeah, would I could. So we must be familiar with it already. New fees could probably get there for you.

        [Steven Timm] 12:02:00
        If anybody wants to add, let's come in here

        [Ian Fisk] 12:02:01
        bye, I think that that those numbers are public, and people want to see them.

        [Steven Timm] 12:02:05
        Okay? Yeah, huh? Right

        [Steven Timm] 12:02:13
        Great. No, no! The other thing that we've noticed from 6 years ago, when we first did the big Cms.

        [Steven Timm] 12:02:21
        Demo on Amazon until now is that probably by pricing is gone up by factor 2.

        [Steven Timm] 12:02:26
        Hi! Amazon, so you could in 2,016 you could go a 25% amount of your price.

        [Steven Timm] 12:02:33
        You can't do that anymore and get any cycles the th that's of interest, I think.

        [Steven Timm] 12:02:38
        And then the third thing is that as far as costing, what does it cost to run it here to on the cloud as opposed to web through our many estimates of that on the website some would agree by factor 4, we're always going to be difference in but sooner or later

        [Steven Timm] 12:02:59
        we're going to go to. Do we say we need more money to put more computer.

        [Steven Timm] 12:03:03
        We need more money for another building. And we're not gonna get it. So there will be a limit to how much we can put on a site.

        [Steven Timm] 12:03:10
        And then that may be the driver eventually to why we need to go to the cloud eventually.

        [Enrico Fermi Institute] 12:03:22
        Okay, thanks, Keith. We'll go to Tony.

        [Fernando Harald Barreiro Megino] 12:03:34
        This is 10,000. This is using the cost, I mean do all of the bus except the last one.

        [Fernando Harald Barreiro Megino] 12:03:41
        It's 10,000 little cpus where you run.

        [Fernando Harald Barreiro Megino] 12:03:44
        Whatever you want, 7 petabytes of standard objects store, and 1.5 petabytes of egos per month.

        [Fernando Harald Barreiro Megino] 12:03:52
        It's not it's support, whatever you're using, for it's not strictly related to a simulation

        [Enrico Fermi Institute] 12:04:03
        Ask a question. Go ahead. You have the the hoax in place, Amanda.

        [Enrico Fermi Institute] 12:04:09
        Capture, What Cpu model the job reports cause.

        [Enrico Fermi Institute] 12:04:16
        Then we can turn around and figure out for the number of virtual cpus you've used for some period of time.

        [Enrico Fermi Institute] 12:04:22
        But the hep specho. Sex equivalent is, and then then one can compare it, at least in you know we know what the tier two's provide in terms of Hs. O.

        [Enrico Fermi Institute] 12:04:36
        6,

        [Fernando Harald Barreiro Megino] 12:04:38
        So I I I have not looked into that for many, for most of the great sites it gets reported back as the okay.

        [Fernando Harald Barreiro Megino] 12:04:50
        Yeah, like the pilot looks for the for that information and reports it back.

        [Enrico Fermi Institute] 12:04:52
        Okay.

        [Enrico Fermi Institute] 12:05:01
        Then it might be very interesting to compare that to the benchmark jobs, where you know I should take with enough spread, so that you get the distribution of what they're actually giving you Yeah, because at the end of the day right we get paid in the us dollars per H. S. O.

        [Enrico Fermi Institute] 12:05:23
        6, okay.

        [Enrico Fermi Institute] 12:05:31
        Comment from Ian

        [Ian Fisk] 12:05:34
        I thought Stephen was done, but yeah, I guess But I just wanted to go back to this bel labor.

        [Ian Fisk] 12:05:44
        This point about cost, and I think the like. I think one thing that we need to assess as a field is to what are the economics that changed as every time we do this evaluation we find that things are a little bit closer to being competitive and at some point maybe they will make the transition over

        [Ian Fisk] 12:05:59
        but something has to happen which is then basically either the the economy of scale associated with am aws and Google has to be so large.

        [Ian Fisk] 12:06:09
        But they can do it more or less, or work cheaper than we can, and still make money, And whether that's a location facility, or whether that's the fact, that we use all our resources only a fraction of the time, or whatever But something has to change, because at the end, of the day like for the

        [Ian Fisk] 12:06:24
        same reason I don't drive a rental car to work.

        [Ian Fisk] 12:06:28
        if you have a facility, if you have a facility which you're using all of the time was you operated yourself.

        [Ian Fisk] 12:06:33
        It's very hard for someone to undercut you, unless they can, either.

        [Ian Fisk] 12:06:38
        They're so large, or they're so cheap, or they bring in cheaper places something we must be able to identify the thing that is going to make it competitive.

        [Gonzalo Merino] 12:06:58
        yeah, so, just hi. Briefly a brief comment just wanted to subscribe to some a previous comment from Kaoshik And I must say I'm a little bit surprised about all this this discussion, whether it is cheaper, or more or more expensive I really think that the I mean has like 170 sites

        [Gonzalo Merino] 12:07:17
        so that the answer will totally be different. For each of those sites.

        [Gonzalo Merino] 12:07:20
        So I think they're going through a Gaussian, which I totally subscribe Is that the value in this exercise, or at least part of it is, Okay, we need to get these numbers like the fernando show so that's super useful.

        [Gonzalo Merino] 12:07:31
        So what's the cost of running this in the cloud in a commercial cloud?

        [Gonzalo Merino] 12:07:35
        And then is, is for each of those 170 sites to get this number, and then compare to their internal costing which completely, will change.

        [Gonzalo Merino] 12:07:43
        I mean depending on size, depending on country. The labor cost is factors.

        [Gonzalo Merino] 12:07:49
        Difference from different countries, So so I mean I I I don't think I mean discussing on whether it's more expensive or or more or cheaper than these on that, side I think it's useless, and whether it's Fermila or a Tier 2 here in Czechoslovakia

        [Gonzalo Merino] 12:08:01
        or in Spain. It's it's for each of the sites in every country to get this number compared to their cost, which everybody knows.

        [Gonzalo Merino] 12:08:10
        And then react accordingly. I would say that that's the value I see, and the and the example from the rental card.

        [Enrico Fermi Institute] 12:08:37
        shooting.

        [Shigeki] 12:08:41
        yeah. One comment I have is this: There, there's really no incentive for any of these cloud providers to become the lowest cost per provider They're in the business to make money, right?

        [Shigeki] 12:08:54
        And they have hordes of the accountants and supercomputers that are that are constantly hedging the cost of everything right there.

        [Shigeki] 12:09:03
        There, there, there business model is value. Add not not to drop to the lowest, lowest cost per provider.

        [Shigeki] 12:09:11
        Right.

        [Dirk] 12:09:21
        the power of competition, I mean they are competing against each other

        [Dirk] 12:09:29
        So

        [Enrico Fermi Institute] 12:09:34
        Yeah, yes.

        [Dirk] 12:09:36
        I mean, that's what the isn't. This saying in any mature market?

        [Dirk] 12:09:39
        The price of a service will basically go down so that the profit approach to 0

        [Enrico Fermi Institute] 12:09:48
        Have you flown recently That's a mature market, and the prices are going the other one.

        [Dirk] 12:09:56
        Well, big slash supply. So that's that's the thing about these data centers.

        [Fernando Harald Barreiro Megino] 12:10:09
        Okay, I think we can go then to the next slide with K.

        [Fernando Harald Barreiro Megino] 12:10:16
        You.

        [Kenyi Paolo Hurtado Anampa] 12:10:19
        okay. Yup

        [Kenyi Paolo Hurtado Anampa] 12:10:28
        We'll take that as now. So Okay, moving on on the Cmx experience.

        [Kenyi Paolo Hurtado Anampa] 12:10:33
        So I tried to summarized and boot. Have a few numbers there from those sources.

        [Kenyi Paolo Hurtado Anampa] 12:10:40
        This is from one paper and some slides of the world.

        [Kenyi Paolo Hurtado Anampa] 12:10:45
        That was done 5 6 years ago, when Amazon on Google Cloud.

        [Kenyi Paolo Hurtado Anampa] 12:10:49
        So again, this numbers are not upgrade, are from 2,016 to substance 17.

        [Kenyi Paolo Hurtado Anampa] 12:10:57
        So things have changed. But the conclusion, the high, level summer of the the conclusion there is that the cause of per core hour for both aws and Google Cloud were close to similar But then the work that was on an amazon was on over the cover of a few days about 8 days and you can see in

        [Kenyi Paolo Hurtado Anampa] 12:11:20
        the stop. Right. Plot the and th the green. That is the production on awws.

        [Kenyi Paolo Hurtado Anampa] 12:11:29
        Just in, kept out on the formula side, and then on the bottom block at It's what do you have from Google Cloud?

        [Kenyi Paolo Hurtado Anampa] 12:11:42
        And the work on Google cloud was done over the course of about 4 days.

        [Kenyi Paolo Hurtado Anampa] 12:11:47
        the goal was to double the size in terms of total available course.

        [Kenyi Paolo Hurtado Anampa] 12:11:57
        With respect to the what we had in the global pool for the demo, the demo was done using.

        [Kenyi Paolo Hurtado Anampa] 12:12:04
        Yeah, that production, simulation workflows and the on premises.

        [Kenyi Paolo Hurtado Anampa] 12:12:11
        Estimate that I put on the paper on the the slides are, what's the estimate on the paper? Again?

        [Kenyi Paolo Hurtado Anampa] 12:12:18
        You have the services linked from Archive. They're in this.

        [Kenyi Paolo Hurtado Anampa] 12:12:25
        and then the the other factor just focus on the operational airport.

        [Kenyi Paolo Hurtado Anampa] 12:12:32
        So for this I got you put from the team from Cloud and the completion is that there was initial effort mostly related to monitoring.

        [Kenyi Paolo Hurtado Anampa] 12:12:47
        And this was to prevent waste of computer resources to track It's stuck jobs or jobs getting or going to slow identify.

        [Kenyi Paolo Hurtado Anampa] 12:12:59
        I bones identify huge log files that call that the cutting current to high transfers.

        [Kenyi Paolo Hurtado Anampa] 12:13:06
        But then after that, the current maintenance is low in terms of effort, with an estimate of just one.

        [Kenyi Paolo Hurtado Anampa] 12:13:17
        If you for that, occasional for things like, for, for, for example, they th this it is still maintained up to up to today.

        [Kenyi Paolo Hurtado Anampa] 12:13:27
        For okay. Tms wants to use it again. And so I I feel a few months ago they didn't work on integrating support for Id Token so that you can, for example, for both main, number you can go.

        [Kenyi Paolo Hurtado Anampa] 12:13:49
        So alright. But we have the last slides with this basically strategy, considerations and discussions, and these are just some bullets to.

        [Kenyi Paolo Hurtado Anampa] 12:14:01
        We talked a lot about cloud costs already, and there are some other bullets there right related to the egress. Cost what is the role?

        [Kenyi Paolo Hurtado Anampa] 12:14:10
        On the other cloud in double your Cg. Discussions how to make the cloud.
         

      • 10:40
        Strategic Considerations 20m

        Strategic considerations and Discussion

         

        [Fernando Harald Barreiro Megino] 12:14:38
        we still have a little bit of time. So I think that I mean on the on the cost.

        [Fernando Harald Barreiro Megino] 12:14:46
        People discussed it a lot, but this is the opportunity to to discuss any other like where your discussion, like, for example, egos, costs or other worries about the Cloud, or are there any particular ideas of how we can make better use of the cloud like to exploit elasticity

        [Fernando Harald Barreiro Megino] 12:15:13
        that I use what whatever Gpus or it is, discussed. Lansium.

        [Dirk] 12:15:32
        I I just wanted to it. We already talked about elasticity, and I just wanted to maybe focus on one of the points on the on the slide.

        [Dirk] 12:15:42
        Say that the different planning horizon versus our own equipment, and that gives you kind of a different layer of elasticity, because when you purchase equipment, it's not only that you have a certain number of deployed deployed cause in your data center but it's also when you purchase

        [Dirk] 12:15:58
        the equipment you usually basically make a commitment for the next 3, 4, or 5 years.

        [Dirk] 12:16:04
        Whatever the retirement window now is for for hardware that you buy.

        [Dirk] 12:16:08
        It's gone up a bit cloud. You don't have to make that commitment.

        [Dirk] 12:16:15
        Now. The the thing is, though, in our science usually we have pretty stable workloads, so we can't really take full advantage of that.

        [Dirk] 12:16:23
        So usually we buy equipment for 4 years, and we we expect, I mean, we have year to.

        [Dirk] 12:16:30
        Year We have always the work to get busy, but I'm not looking out.

        [Dirk] 12:16:35
        There's the the dip in and before Hlc.

        [Dirk] 12:16:42
        Comes up. I don't know if that's something that where Cloud maybe could help.

        [Dirk] 12:16:48
        If you basically, if at that point we were like 20% cloud, you could say, Okay, for these years, for the off years shutdown years, you could just not buy any cloud cycles.

        [Dirk] 12:16:59
        I'm not sure how that would play with subscription at the renewal like. If you are in a subscription model, and you could just skip a renewal and then resume. A year.

        [Dirk] 12:17:08
        Later. But that's a that's a possibility. And you don't kind of you really don't have that with purchase equipment, because you you you kind of continuously keep buying equipment.

        [Dirk] 12:17:21
        Just to not have like With everything retired all at once.

        [Dirk] 12:17:24
        I mean, you kind of cycle over your data center of all.

        [alexei klimentov] 12:17:41
        I I think this is a very simplistic approach.

        [Enrico Fermi Institute] 12:17:41
        Still had Alexa

        [alexei klimentov] 12:17:48
        I I I think what at least we are trying to do in Atlas.

        [alexei klimentov] 12:17:53
        We are trying to integrate calls in our computing model, and it is not Oh, we described. I want to remind you what one of the first topics at least which I remember about using clothes for was done by Bell experiment.

        [alexei klimentov] 12:18:14
        Not by Bill, but by Bill when they needed a Monte Carlo campaign to conduct a Monte Carlo campaign, and the way you signed it.

        [alexei klimentov] 12:18:23
        But for wham it is cheaper just to buy cycles and to run this Monte Carlo campaign.

        [alexei klimentov] 12:18:32
        So I I think just this comparison, and also what was mentioned before by several people.

        [alexei klimentov] 12:18:40
        But is it replacement of how with your tools?

        [alexei klimentov] 12:18:44
        Of course not. Post notice, not replacement, but it is resources which we can use, and elasticity for me.

        [alexei klimentov] 12:18:52
        It is one of them. Main features which we can use, and as Paula mentioned also before we go to purchase something new what we don't have now, we can try it in the cloud.

        [alexei klimentov] 12:19:07
        I also kind of disagree with statement with our workforce.

        [alexei klimentov] 12:19:12
        very stand up, or whatever you use, because what we see even now, and I think it will.

        [alexei klimentov] 12:19:19
        be in this direction, but you have new workforce, more complex for falls, which we at least an office.

        [alexei klimentov] 12:19:29
        We did not have during ground through and for high luminos, she will be more and more like that.

        [alexei klimentov] 12:19:33
        So that's why I think that's the problem is more complex, And we need to address it on more or complex way, and not try to what I'm afraid.

        [alexei klimentov] 12:19:45
        But if you start to, you know, split it on small pieces of, then we all know.

        [Eric Lancon] 12:20:02
        yes, sorry was Mr. Do you agree with Alexey that they are?

        [Eric Lancon] 12:20:09
        There are more complex workflows coming, and there are a need to adapt.

        [Eric Lancon] 12:20:13
        why, the what? I don't follow fully, the conclusion is that the cloud is most suited for this.

        [Eric Lancon] 12:20:22
        the facility needs to work to adapt to the new requirements.

        [Eric Lancon] 12:20:28
        And that's what make at the end the comparison.

        [alexei klimentov] 12:20:46
        if you could find my comment, I fully agree with you. I I fully agree with you, and that's why.

        [alexei klimentov] 12:20:52
        what will be. Try and pineapple, that is, a full chain for me.

        [alexei klimentov] 12:21:00
        It is bigger ones. And first 8 days also to try.

        [Enrico Fermi Institute] 12:21:29
        so one comment I had, I mean we we've spent a lot of time talking about how the clouds hook into the existing workflow system, panda and Whatnot.

        [Enrico Fermi Institute] 12:21:39
        And does it make sense to further, you know, talk about, or explore how clouds can either be used as an analysis facilities or extending analysis facilities in some way you know 1 one of the things that that you know the users might want work for example, or things like you know exotic type

        [Enrico Fermi Institute] 12:22:01
        of things, or or accelerators. You know. Gpus, things like that.

        [Enrico Fermi Institute] 12:22:05
        Can we use clouds to can sort of pat out those kind of resources that analysis facilities? Does that? Does it make sense to explore that

        [Fernando Harald Barreiro Megino] 12:22:14
        so in in all of the problems, both in that there there is always a possibility for the user to get account and really to whatever they need.

        [Fernando Harald Barreiro Megino] 12:22:31
        if it's more of a a central analysis, facility

        [Fernando Harald Barreiro Megino] 12:22:39
        The the analysis facilities that we are usually talking about, the Atlas Or Cms.

        [Fernando Harald Barreiro Megino] 12:22:46
        For that there will also be in the Atlas Project, and R. And D.

        [Fernando Harald Barreiro Megino] 12:22:51
        To to extend that. And okay, because presented some ideas to do that last week or 2.

        [Enrico Fermi Institute] 12:23:14
        It is Mike supporting. It so something that was interesting, and I don't have it right at my fingertips.

        [Enrico Fermi Institute] 12:23:20
        But Purdue actually got a Purdue university, actually got a pretty big grant from Google to set up a system where basically their badge system can burst into the Google cloud But they also have all the vpns and whatnot set up And the images are the same image as their you know, their

        [Enrico Fermi Institute] 12:23:42
        compute farm is, and with the VPN setting up the networking or whatnot, the the remote hardware that the cloud hardware is put on, quote the same as just the regular best they have there so You know outside of latency, or whatever you're you're basically can run just

        [Enrico Fermi Institute] 12:23:57
        slam in contour, or I think they run the storm there. You could slam in slum jobs and run whatever you want, So there's definitely work that's been done.

        [Enrico Fermi Institute] 12:24:32
        So maybe to bring up another topic from yesterday, we and we we mentioned here a little bit about, you know, using Cloud to run some kind of particular campaign or What have you does does that have any effect on on how we think about pledging clouds

        [Enrico Fermi Institute] 12:24:53
        And then, general, are there any any discussions over, want to have about pledging clouds

        [Enrico Fermi Institute] 12:25:04
        Turk. You want to jump in.

        [Dirk] 12:25:06
        yeah, I think I think cloud the the cloud fits into the discussion we had yesterday with pledging.

        [Dirk] 12:25:15
        I think, under the current rules to pledge a cloud, you would have to pledge a certain minimum amount.

        [Dirk] 12:25:22
        Of course. So if you replicate aside where you business always give like run, keep 4,000 calls running.

        [Enrico Fermi Institute] 12:25:23
        Yeah.

        [Dirk] 12:25:29
        You could pledge the 4,000 cores, but you couldn't, couldn't really take advantage of elasticity.

        [Dirk] 12:25:35
        So you kind of would have to pledge to lower boundary, because at the moment, with within some limits, because even even grid sites are allowed to cool below the floor for limited amount of time, I think so But but it puts limits on your on your basically how flexible you can use the

        [Dirk] 12:25:53
        resources The same problem we have with the scheduling on the Hpc.

        [Dirk] 12:25:57
        That that you basically you can't just keep the keep it off for 10 for 11 months of the year, and then use up everything in a month that wouldn't work with How the pledges are structured right?

        [Dirk] 12:26:08
        Now, and what the Rules are.

        [Enrico Fermi Institute] 12:26:09
        We pledge Hs. O. 6. Not course

        [Enrico Fermi Institute] 12:26:21
        So I but I. The point is that we have to figure out right.

        [Enrico Fermi Institute] 12:26:27
        If you gotta even consider pledging cloud research how to put it in a unit that is consistent with what we have So it's an apple staples.

        [Steven Timm] 12:26:59
        Yes, I was going back to the question of exotic resources.

        [Steven Timm] 12:27:04
        And I know they come with me yesterday that the exotic resources, such as the p machines of Amazon, the the Fpga is in the tensor, things or whatever are always the most highest price things you can get but you still have to weigh that as opposed to having more having them sit on

        [Steven Timm] 12:27:21
        site as somebody on premise, somebody singing and sucking up for all the time.

        [Steven Timm] 12:27:25
        And not being used all the time, at least, we don't yeah have a Dc.

        [Steven Timm] 12:27:31
        Use case for gpus or tensorflow with your fees, or whatever it was about that.

        [Steven Timm] 12:27:37
        So there is value, and I've heard that from management that they prefer.

        [Bockelman, Brian] 12:28:08
        yeah, I I just wanted to to. Maybe tackle something.

        [Bockelman, Brian] 12:28:13
        But what Doug said a little differently. It's I I'm worried less about the have spectacle.

        [Bockelman, Brian] 12:28:20
        6 equivalent. But the the fact that for cloud resources you probably need to pledge and Hep Speckle 6 h. Right?

        [Bockelman, Brian] 12:28:30
        Right, we we you know. It's it's the difference between kill a lot versus kilowatt hours, you know, at some aspect of the pledge.

        [Bockelman, Brian] 12:28:39
        Or, again, going to the the power. Grid analogy needs to be in kilowatt hours.

        [Bockelman, Brian] 12:28:45
        and what what what the benchmarks is, and I think it's less important.

        [Bockelman, Brian] 12:28:49
        but you know, How do you come up with a proposal that balances the fact that you do need some base capacity, and that's that's important.

        [Bockelman, Brian] 12:28:59
        But we it's very unlikely. A 100% of our hours need to be the the base capacity.

        [Bockelman, Brian] 12:29:06
        So, some combination of kill a lot and kill what hours and earth yeah analogies in our pledges.

        [Johannes Elmsheuser] 12:29:20
        right, a follow-up comment to this right, and at the end the pledges are, always as you say, a unit per year, right?

        [Johannes Elmsheuser] 12:29:31
        And we don't have for it's a unique Cpu architecture as well, right?

        [Johannes Elmsheuser] 12:29:37
        So there's always over the years with all the people appropriate, human, different, different kind of Cp architecture.

        [Johannes Elmsheuser] 12:29:47
        Still? What what and what's that before? Right? We we have more or less the same problem also on the grid.

        [Johannes Elmsheuser] 12:29:57
        We are also averaging there. So we don't have the same unit over and over at the same site.

        [Johannes Elmsheuser] 12:30:03
        Right. So in principle we are solving here, then, on the cloud the same problem.

        [Johannes Elmsheuser] 12:30:08
        So I I don't see this really as as proper automatic in that sense, because we we have exactly done the same thing, or plus 1015 years in the grid

        [Bockelman, Brian] 12:30:18
        yep, I I I don't think I'm following, cause what what we pledge on the grid is certain. Heps.

        [Bockelman, Brian] 12:30:27
        Spec, Oh, 6 capacity that that is available. Starting at a given time period.

        [Bockelman, Brian] 12:30:33
        Right, let me say we. Oh, but but it's it's

        [Johannes Elmsheuser] 12:30:34
        Right? And that's for one year, right? It

        [Johannes Elmsheuser] 12:30:40
        It's good for one year, and and at the site you don't have a specific unit unit of one Cpu: right?

        [Johannes Elmsheuser] 12:30:47
        You have always an average, and that was the argument before that.

        [Bockelman, Brian] 12:30:50
        Oh!

        [Bockelman, Brian] 12:30:55
        Hmm! No, no, no! But that's very different. It's it's not the average right cause.

        [Bockelman, Brian] 12:31:01
        I I can't come in and give you 12 times as much capacity.

        [Bockelman, Brian] 12:31:03
        In January, and and 0, it out for the next 11 months.

        [Bockelman, Brian] 12:31:07
        That that is most definitely not what the the mo use say.

        [Bockelman, Brian] 12:31:12
        It's very specific. He spectacular. 6 count available you know, depending on whether you're tier one or tier, 2.

        [Bockelman, Brian] 12:31:19
        I figure what the number or 85, 95% of the time.

        [Ian Fisk] 12:31:27
        right.

        [Johannes Elmsheuser] 12:31:28
        sure I but I agree that you you give an average basically over a certain time period.

        [Johannes Elmsheuser] 12:31:34
        I think we we agree here right and and as you say, we then have to say, Okay, you provided this.

        [Johannes Elmsheuser] 12:31:41
        Then 4 months, or for 3 months, or something like this. And this is then the pl.

        [Ian Fisk] 12:31:49
        No, I I guess also I'd like to argue that our pledging model, as it's right now, is probably not ideal, for that we have a model which is based on the fact that we have dedicated facilities we've been purchased, and the experiment's responsibilities to

        [Ian Fisk] 12:32:04
        demonstrate that over the course of 12 months they can average. Because they can use them in some average rate, that we both provision and schedule for average utilization and whether it's Hbc.

        [Ian Fisk] 12:32:14
        Or whether it's clouds, there's an opportunity to not do that, and we might find as collaborations that the ability to to schedule 5 times more for some period of a month, and allow you to hold them on a call for a year done was actually a much more efficient use of people's

        [Ian Fisk] 12:32:32
        time, and that our current existing, pledging model is sort of limiting.

        [Ian Fisk] 12:32:36
        I think they they. I believe Maria Geron, who's connected, presented this at Chef Osaka.

        [Ian Fisk] 12:32:41
        Probably 6 years ago. The concept of scheduling for peak, and it seems like we, because we have dedicated resources, and we have to show that they're well.

        [Dirk] 12:33:25
        yeah, and maybe maybe one complication with scheduling for Peak.

        [Dirk] 12:33:30
        You actually have to think about and justify using what you want to use for the peak.

        [Dirk] 12:33:36
        So it's it's more complicated to plan this, and steady state is You just keep it busy

        [Ian Fisk] 12:33:39
        it is more comfortable. No, it's it's it's more complicated to plan.

        [Ian Fisk] 12:33:44
        It requires people to be better prepared. It requires people to.

        [Dirk] 12:33:47
        Yeah. But that's maybe why it hasn't happened yet.

        [Ian Fisk] 12:33:49
        I right, but at at the same time it would allow, like, imagine that a 6 month Monte Carlo campaign was a one month, Monte Carlo campaign, and then Sp.

        [Ian Fisk] 12:33:58
        5 months, where people having to complete set for analysis, that might be a much more efficient.

        [Ian Fisk] 12:34:04
        And that's also, I think, a motivation for why you might want to go to clouds rates, we see, even if they were on paper more expensive, because you'd have to make some metric which is how much time people's time you're saving

        [Enrico Fermi Institute] 12:34:17
        which time are you trying to say you're saving

        [Ian Fisk] 12:34:22
        I would claim Oh, well, the entire collaboration time to physics, Perhaps I'm saying

        [Enrico Fermi Institute] 12:34:23
        Which people which people's time

        [Enrico Fermi Institute] 12:34:34
        How do you accurately measure without drawing a false conclusion?

        [Ian Fisk] 12:34:40
        Hi! I don't. I think it's difficult to.

        [Ian Fisk] 12:34:42
        I I think it's probably somewhat difficult to measure the inefficiency that we have right now, but I think you can.

        [Enrico Fermi Institute] 12:34:48
        Okay.

        [Ian Fisk] 12:34:49
        I think, without drawing a false conclusion, I think I can claim that the this particular way it's set up right now is designed to optimize a specific thing which is the utilization of particular just resortions

        [Ian Fisk] 12:35:14
        and that's I guess I'm claiming that's not the like.

        [Ian Fisk] 12:35:18
        If I assume that's the most important thing, because we spent all this money buying dedicated computers.

        [Ian Fisk] 12:35:23
        Yeah, that's a reasonable thing to say. We're not gonna let these things today.

        [Ian Fisk] 12:35:27
        We're not gonna over provision, but I think it's it's it's very difficult to say that you can state the that optimization was designed to use this particular resource happens to also be exactly the perfect optimization.

        [Ian Fisk] 12:35:40
        For these other kinds of resources like time to physics, like what a like!

        [Dirk] 12:35:56
        all efficient use of resources. I mean, that's the one thing, Cloud, and and you buy the re.

        [Dirk] 12:36:02
        That's the one main difference. I I see you. You buy resources.

        [Dirk] 12:36:07
        You have them sitting on your floor, you might as well use them, because it's already paid for.

        [Dirk] 12:36:10
        So it's already paid for. So at that point, use doesn't okay energy costs whatever.

        [Dirk] 12:36:14
        But you, you kind of have to keep him busy. Hbc.

        [Dirk] 12:36:16
        And Cloud, You kinda have to. You justify because you're more elastic.

        [Dirk] 12:36:19
        So you get the allocation, and especially with Cloud. You You wanna make use of like flexible, elastic, scheduling.

        [Dirk] 12:36:28
        So at that point you have to justify each use So it's it's more complicated to to do that.

        [Dirk] 12:36:34
        But hopefully, if if you do it right, you get a more efficient use of resources out of it.

        [Enrico Fermi Institute] 12:36:43
        But how do you measure that

        [Dirk] 12:36:46
        It's I don't know.

        [Enrico Fermi Institute] 12:36:50
        Because think of it, this rate is a 10% cut of what we're doing Now, as you let's say that 10% diverts to the cloud. Then you have to see if that 10% divert the 10% diversion would give, you more bang for the park

        [Ian Fisk] 12:37:21
        and I. Well, we we actually we did this only a standpoint in a country way for disaster.

        [Ian Fisk] 12:37:28
        Recovery, which would be, What would it cost you? The scenario is, I've messed up my reconstruction I need to reprocess things, and I only have a month.

        [Ian Fisk] 12:37:39
        what is there Is there a model which says, there's a reasonable insurance policy which says, I'm gonna use the cloud for that kind of thing.

        [Ian Fisk] 12:37:45
        And so in some sense, you can make arguments for like, where this is valuable in very specific situations like there's been a problem.

        [Johannes Elmsheuser] 12:38:25
        I have a completely different common to a question. On the third point you have here, with the bullet point data.

        [Johannes Elmsheuser] 12:38:32
        So safeguarding. Is this something of concern or not?

        [Johannes Elmsheuser] 12:38:40
        To all. Just we just say the with your team has to basically safeguard our data for well to against users who are repeatedly downloading this.

        [Johannes Elmsheuser] 12:38:54
        And and then we are safe. What is there something behind the other?

        [Fernando Harald Barreiro Megino] 12:38:56
        what.

        [Johannes Elmsheuser] 12:38:59
        Something other behind this data, safeguarding keyword.

        [Johannes Elmsheuser] 12:39:02
        Here.

        [Fernando Harald Barreiro Megino] 12:39:03
        Well, that's a comment that sometimes I hear that you don't want to have the like.

        [Johannes Elmsheuser] 12:39:27
        Okay, right, So that that's the computing model that you have always the, so to say, another unique copy of your raw data.

        [Johannes Elmsheuser] 12:39:40
        For example, in the cloud that would be behind that

        [Fernando Harald Barreiro Megino] 12:39:43
        yeah, So like, what overall the role is it like Can a cloud be a nucleus?

        [Fernando Harald Barreiro Megino] 12:39:50
        Can so for Cloud only be treated as about 10 temporary.

        [Fernando Harald Barreiro Megino] 12:40:00
        so th the point is to let people express any any worries regarding this

        [Ian Fisk] 12:40:12
        I guess I would like to express a worry regarding that which is that I don't think that any reasonable funding agency is going to let you make a custodial copy of the data in the cloud because there's no guarantee that they don't change the rate to become

        [Ian Fisk] 12:40:28
        prohibitively expensive to move things out or prohibitively makes best move things in.

        [Ian Fisk] 12:40:33
        And in the same way that the agency won't let you sign a a 10 year lease on a fiber without tremendous amounts of negotiation.

        [Ian Fisk] 12:40:40
        They're not going to allow you to make a commitment in perpetuity for data storage.

        [Ian Fisk] 12:40:44
        So I think that actually almost by definition puts the clouds in a very particular place in terms of storage and processing to things that are transient, and things that can be there recorded at the end of the Job and the things that are done at the end because otherwise you're in the situation

        [Kaushik De] 12:41:16
        yeah, coming back to the question of how to make the most out of the case.

        [Kaushik De] 12:41:20
        I mean one of the things that we have heard a lot over the past many years actually are the Ai Ml tools and capabilities and ecosystem on the cloud is that something we should continue to pursue is that something that should be added to the list in terms of are we missing out on something

        [Enrico Fermi Institute] 12:41:33
        Okay.

        [Kaushik De] 12:41:47
        or is that something that we think know how to do better with our own tools?

        [Dirk] 12:41:55
        there is a session in the afternoon actually on and D.

        [Dirk] 12:41:58
        It's specifically a machine learning, training, And we actually have an invited talk from Son.

        [Dirk] 12:42:04
        I think they I I think it's Hbc.

        [Dirk] 12:42:07
        Training on Hbc: but it's similar, I mean, it's both Hbc.

        [Enrico Fermi Institute] 12:42:21
        It's also the case that the clouds do have some kind of proprietary exotic cards right that they that aren't available to the general public that are really meant for machine learning applications.

        [Dirk] 12:42:37
        yeah, but they they had. The bigger question is, then, what role will machine learning play in?

        [Dirk] 12:42:46
        In our basically computing operations, going going forward. And I I don't know. We have the answer.

        [Dirk] 12:42:50
        Neither seems, not Atlas. The final answer on that.

        [Dirk] 12:42:53
        So it's a bit hard to to say. This is the way to go.

        [Kaushik De] 12:43:02
        I mean the one thing that yeah, I think we are.

        [Kaushik De] 12:43:11
        You know we have been trailblazers in many, many areas, but I think in when it comes to the production use of aiml when it comes to everyday use of aiml.

        [Kaushik De] 12:43:26
        I think cloud and business systems that do so much of it.

        [Kaushik De] 12:43:34
        how do we, or pull that up and access that?

        [Kaushik De] 12:43:40
        And I'm not just paranoid, but to me, for for or perfect production level activities, because I noticed that almost anything that we look okay nowadays that Google does anything from their own products like maps and this that to services, that they are provide I mean it's really heavily

        [Kaushik De] 12:44:08
        dominated with aiml. I mean, it's almost exclusively that we Dml. But are we?

        [Dirk] 12:44:21
        let me. Maybe I can make a comment because the like can.

        [Dirk] 12:44:25
        You yesterday showed a used case. Cms, where they basically ran a miniod production, which is basically you take the the aod, which is a larger analysis format, and then slim it down and do some recomputations.

        [Dirk] 12:44:37
        To get it to a Miniod, which is smaller and actually useful.

        [Dirk] 12:44:40
        Analysis, and they They are pushing for the model where they do machine learning algorithm.

        [Dirk] 12:44:47
        They basically use algorithm does use machine learning. But then, during the production phase, you run only the inference server. So it's not actually you're not running the the learning.

        [Dirk] 12:44:55
        And that's that's for me, is the bigger question.

        [Dirk] 12:44:58
        Because if you do a one time shot where you're done, you run your learning algorithms on a bunch of data that we have.

        [Dirk] 12:45:04
        You figure out what you want to do, and then you only run the inference.

        [Dirk] 12:45:08
        During the heavy lifting reconstruction. Whatever else you do, then that's I'm not sure to what extent this is really impacting the overall computing operations.

        [Kaushik De] 12:45:32
        Yeah, And another aspect of this is that elasticity comes in when you talk about training, I mean, unless you go to control continuous training models, people are trying to do so.

        [Dirk] 12:45:57
        how much these large training runs, how much capacity is.

        [Dirk] 12:46:03
        Are we really talking about is is that making an impact on our overall compute resource use

        [Kaushik De] 12:46:28
        yeah, and we under H speed the service already in that. That's as a service.

        [Dirk] 12:46:28
        Okay, So that.

        [Ian Fisk] 12:46:43
        I think that's probably one of the ideal applications for primarily for Hpc.

        [Dirk] 12:46:46
        Yeah.

        [Ian Fisk] 12:46:48
        Because they already have that kind of hardware, and it doesn't.

        [Dirk] 12:47:03
        The the one thing, though, is it? This kind of application? Will Will it goes, and we will make a comment on under the report.

        [Dirk] 12:47:10
        But it it by design. It kinda happens outside the current production systems and infrastructure So it's kind of standalone so I'm not sure to what extent it's it's really in scope.

        [Dirk] 12:47:22
        For the report

        [Ian Fisk] 12:47:22
        I I I think this is one of the places where the concept of scheduling for peak comes into play, because, as you go to more machine learning things that require training and high parameter tuning, before you start running you change when the computing is spent, you spend the computing beforehand, and

        [Dirk] 12:47:37
        Yes.

        [Ian Fisk] 12:47:39
        then it's much faster on things like inference. And so it is a place where, like the model that says we're gonna use them all in Dc.

        [Dirk] 12:47:56
        And it also, I mean, it's that's even what, where I see him watchings.

        [Dirk] 12:48:01
        If if if this like, exploring the sinking out of it, the pledging, such resources, if you assume that this resource use is significant, you want to be able to pledge it.

        [Enrico Fermi Institute] 12:48:14
        Okay.

        [Dirk] 12:48:15
        But it's a single perp purpose pledge, which is completely outside the the scope of what w pledging currently is.

        [Dirk] 12:48:22
        But you want to get some kind of credit for such a used case, so that's that's even worse than than just what we discussed so far, which is basically just adjusting the the pledging to be more.

        [Dirk] 12:48:37
        Like a time, integrated value, not just the in the Ac.

        [Ian Fisk] 12:48:41
        right, and and this, and the kind of resources we're talking about here are the most expensive things we have.

        [Dirk] 12:48:41
        Dc. Argument

        [Enrico Fermi Institute] 12:48:54
        So maybe that needs to be written in the final report, so that they get there's the idea to push for flexibility

        [Enrico Fermi Institute] 12:49:12
        Because it is a different thing. You really do want to use for the training stuff that's designed for it work so much better

        [Enrico Fermi Institute] 12:49:22
        Which makes it special cause. I specialized until our code stack uses.

        [Dirk] 12:49:40
        I mean, we're trying that, too. It's if we had. This is.

        [Dirk] 12:49:44
        This is active area of on D trying different approaches. I mean Cms: We have the the hlt.

        [Dirk] 12:49:50
        That's attracting Hot tracking basically runs on Gpu, And that says pretty significant speed up.

        [Steven Timm] 12:50:13
        good student, just with you guys in Lensium, but also for some of the other more exotic resources, is even more probably on the Hps on The Lcf.

        [Steven Timm] 12:50:23
        System instead, that there are opportunities for things that can be opportunistically can go and grab a couple, or the computer come back with useful stuff.

        [Steven Timm] 12:50:36
        there. You may want to think about. Do you need? Is there a sense redesigned the workload that has to happen to best exploit those kind of resources because some some more folks are more.

        [Steven Timm] 12:50:52
        If you pre, you lose everything, Basically, if you're running for 10 h, you get 12 to go, or something like that.

        [Steven Timm] 12:50:58
        So I mean, we hit on, for instance, that you could only get a 24 h job link if you submitted at least a 1,000 jobs.

        [Steven Timm] 12:51:08
        Say Rosa is, let me consider. Okay, I don't have any answers for that, but something you should keep in mind when you're planning or non conventional resources.

        [Steven Timm] 12:51:20
        If you make sure you can get more stuff done

        [Dirk] 12:51:23
        I I think that's that's where that's one of the differences between the approaches and Targeting Hbc: But that's that's mostly affects Hbc because cloud cloud just allows you to schedule whatever you're paying for it.

        [Dirk] 12:51:35
        So they they don't

        [Steven Timm] 12:51:38
        Well, Lensium can go down any time right

        [Dirk] 12:51:40
        They can; but in practice, I mean, if they go down every 30 min, it it probably would become unusable for us, so we kind of rely on the fact that, in in in essence, even though what what in principle can go down every 30 min It doesn't actually happen all that often and we we cover

        [Dirk] 12:52:00
        whatever we make it an efficiency problem. Basically, I'll I'll fail your handling codes and Our software.

        [Dirk] 12:52:06
        Stack can deal with it, and it just becomes an efficiency issue that goes goes into the cost.

        [Dirk] 12:52:10
        Calculation. I think, if it gets it gets more complicated than that, it becomes really really problematic to use the resources, and I know that Atlas has the harvest the model in principle, you can survive.

        [Dirk] 12:52:23
        Like you can make use of of very short time windows.

        [Dirk] 12:52:28
        But we don't have that in Cms, and I'm not sure how effective that is for Atlas, either

        [Fernando Harald Barreiro Megino] 12:52:46
        check. Can you link on What do you do you think we should close this session?

        [Dirk] 12:52:56
        Yeah, it's only I mean, it's less than 10 min. There.

        [Dirk] 12:52:59
        There was some talk about maybe putting one on the talk early, but that's not enough time, and that would probably trigger discussion.

        [Enrico Fermi Institute] 12:53:00
        The

        [Dirk] 12:53:07
        So we can go with it first in the in the next session.

        [Enrico Fermi Institute] 12:53:11
        Yeah, I think the discussions we've been having less 10 or 15 min lead nicely into the R.

        [Enrico Fermi Institute] 12:53:17
        And D Presentation.

        [Enrico Fermi Institute] 12:53:25
        Maybe we we break here unless anybody has any other cloud topics that they want to bring up.

        [Enrico Fermi Institute] 12:53:30
        I think this is the last session that's focused exclusively on cloud

        [Enrico Fermi Institute] 12:53:37
        Yeah, in the next session. We'll talk about some R.

        [Enrico Fermi Institute] 12:53:43
        And D things, and and networking

        [Enrico Fermi Institute] 12:53:53
        Okay, So maybe we break here and we'll we'll see everybody at at one o'clock.
         

      • 11:00
        Discussion 1h
    • 12:00 13:00
      Lunch Break 1h
    • 12:35 15:00
      Second Day Afternoon

      [Eastern Time]

       


      14:00:29,710 --> 14:00:34,679
      Enrico Fermi Institute: I think this is the last session that's focused exclusively on cloud.

      616
      14:00:36,900 --> 14:00:37,920
      Yeah.

      617
      14:00:38,670 --> 14:00:44,219
      Enrico Fermi Institute: In the next session we'll talk about some R and D things and and networking. So

      618
      14:00:52,660 --> 14:00:57,720
      Enrico Fermi Institute: okay, so maybe we break here and we'll we'll uh see everybody at one o'clock central time,

      619
      14:00:58,540 --> 14:01:00,130
      Fernando Harald Barreiro Megino: so you know

      620
      14:01:01,310 --> 14:01:02,620
      Enrico Fermi Institute: she learning,

      621
      14:01:03,820 --> 14:01:09,699
      Enrico Fermi Institute: and then we'll we'll go back to the the topics as presented in the slides,

      622
      14:01:10,610 --> 14:01:12,850
      Enrico Fermi Institute: so we'll just get started in a few minutes here,

      623
      14:01:53,380 --> 14:01:56,370
      Maria Girone: so it's a Eric starting first, right.

      624
      14:01:56,540 --> 14:02:04,520
      Maria Girone: Yeah, if Eric is ready to present, we thought maybe it would be best to just have them

      625
      14:02:07,990 --> 14:02:16,680
      Enrico Fermi Institute: getting a little bit late. Concerned. Yeah, exactly. We want to be considerate of people's time in your especially. Thank you.

      626
      14:02:29,590 --> 14:02:39,579
      Enrico Fermi Institute: So just give it like two more minutes, and then um, Eric, whenever you're ready to, you know. Put your slides up. I'll I'll stop sharing here. Um, when we get started shortly.

      627
      14:02:42,740 --> 14:02:47,450
      Eric Wulff: Sounds good. I'm uh ready. Whenever So just let me know. Okay,

      628
      14:02:48,630 --> 14:02:49,570
      you

      629
      14:02:54,390 --> 14:02:55,219
      The

      630
      14:03:09,350 --> 14:03:17,650
      Enrico Fermi Institute: It seems like the rate at which people have. Uh that rejoining has has slowed down significantly. So I think you can go ahead and and and and start

      631
      14:03:22,080 --> 14:03:23,529
      Eric Wulff: uh, Okay.

      632
      14:03:24,610 --> 14:03:25,870
      Eric Wulff: So

      633
      14:03:27,290 --> 14:03:31,050
      Eric Wulff: i'm sharing. Now, I think. Can you see?

      634
      14:03:31,340 --> 14:03:33,999
      Eric Wulff: Yes, it looks good. Okay, great.

      635
      14:03:34,560 --> 14:03:37,929
      Eric Wulff: Um. So I I just have a

      636
      14:03:38,180 --> 14:03:52,689
      Eric Wulff: two or three slides here. So it's a very short presentation just to talk a little bit about what we have been doing uh regarding distributed training and hypertuning uh of deep learning based algorithms using you have from us computing.

      637
      14:03:53,360 --> 14:04:00,499
      Eric Wulff: So this is something that I have been doing in context of the A Eu Funded Research project called Say, We race

      638
      14:04:06,260 --> 14:04:08,620
      Eric Wulff: involved in this, and she's my supervisor.

      639
      14:04:09,580 --> 14:04:10,969
      Um.

      640
      14:04:12,850 --> 14:04:15,450
      So let's see if I can change slide.

      641
      14:04:15,770 --> 14:04:17,940
      Eric Wulff: Yes, um.

      642
      14:04:18,590 --> 14:04:24,429
      Eric Wulff: So just for for if you're not aware uh hyper parameter organization. Um.

      643
      14:04:25,320 --> 14:04:35,079
      Eric Wulff: So if you're not aware of what that is, I've tried to use it very quickly here in just one slide. So it's. I will sometimes refer to it as as a hyper tuning,

      644
      14:04:35,140 --> 14:04:36,670
      Eric Wulff: and um,

      645
      14:04:36,730 --> 14:04:39,300
      Eric Wulff: it's basically to um

      646
      14:04:39,340 --> 14:04:49,350
      Eric Wulff: to tune the uh hyper parameters all the an Ai model or a deep learning model, and hyper parameters are simply the model sets. Um

      647
      14:04:58,840 --> 14:05:09,139
      Eric Wulff: um, and they can define things like the model architecture. So, for instance, how many layers you have in your neural network? Um, How many notes you have in each layer, and so on.

      648
      14:05:09,520 --> 14:05:19,239
      Eric Wulff: Um, but they also define things. Um, that has to do with the optimization of the model, such as the learning rates, the back size and so forth.

      649
      14:05:19,720 --> 14:05:20,570
      Yeah.

      650
      14:05:22,180 --> 14:05:28,950
      Eric Wulff: Now, if you have a a a large model, or a very top complex model, which it requires a lot of compute to

      651
      14:05:29,220 --> 14:05:30,469
      Eric Wulff: and

      652
      14:05:31,480 --> 14:05:33,510
      Eric Wulff: uh, to do the forward pass,

      653
      14:05:33,610 --> 14:05:34,950
      Eric Wulff: and

      654
      14:05:35,630 --> 14:05:38,329
      Eric Wulff: and or you have a large data sets.

      655
      14:05:38,360 --> 14:05:41,660
      Eric Wulff: Um. Hypertine can be extremely

      656
      14:05:41,940 --> 14:05:56,630
      Eric Wulff: compute resource intensive. So, therefore it can benefit greatly from Hbc. Resources. And uh, Furthermore, we need a of smart and efficient solid search algorithms to find good hyper parameters, so that we we don't waste the Hpc resources that we have

      657
      14:05:59,290 --> 14:06:00,480
      Eric Wulff: um.

      658
      14:06:01,000 --> 14:06:10,500
      Eric Wulff: So in race uh, I have been working with uh a group working on machine and particle flow uh, which is a

      659
      14:06:10,810 --> 14:06:13,939
      Eric Wulff: uh in collaboration with Cms

      660
      14:06:14,080 --> 14:06:17,230
      Eric Wulff: with people from Cms. Um, And

      661
      14:06:17,420 --> 14:06:19,599
      Eric Wulff: in order to high opportunity, this model

      662
      14:06:19,690 --> 14:06:25,310
      Eric Wulff: um in race we have been using uh an open source framework called rate you

      663
      14:06:25,750 --> 14:06:34,059
      Eric Wulff: uh, which allows us to run many different trials in parallel, using uh multiple gpus per trial

      664
      14:06:34,270 --> 14:06:39,010
      Eric Wulff: uh, which is uh what this picture up here is trying to represent.

      665
      14:06:39,570 --> 14:06:40,990
      Eric Wulff: And

      666
      14:06:42,990 --> 14:06:51,389
      Eric Wulff: now, with Rachel we can also get the very nice overview of the different trials, and we can. We can pick the one that we see, performs the best

      667
      14:06:51,580 --> 14:06:57,289
      Eric Wulff: uh and right, and also has a lot of different search algorithms that uh

      668
      14:06:57,660 --> 14:07:01,359
      Eric Wulff: help us to in the the right uh

      669
      14:07:01,690 --> 14:07:02,970
      Eric Wulff: I, the parameters.

      670
      14:07:03,430 --> 14:07:18,949
      Eric Wulff: And here, to the right, we have an example of of the kind of a difference this can make to to the learning of the model. So Here we have plotted the um training and validation losses for, and after hyper tuning,

      671
      14:07:20,620 --> 14:07:32,120
      Eric Wulff: so as you can see here, the the loss went down quite a bit after hypertuning almost by a factor of two, and the furthermore, the the training seems to be much more stable. We have a

      672
      14:07:32,380 --> 14:07:36,559
      Eric Wulff: these bands which will present the the standard deviation of

      673
      14:07:36,750 --> 14:07:42,170
      Eric Wulff: between different trainings. It's it's much more stable in the right plot.

      674
      14:07:47,030 --> 14:07:56,090
      Eric Wulff: Um and I just had one more slide here to sort of illustrate how you still uh high performance computing can be in order to speed up

      675
      14:07:56,810 --> 14:07:58,380
      parameter optimization.

      676
      14:07:58,560 --> 14:08:03,430
      Eric Wulff: Uh. So this just shows the scaling uh from four to twenty-four

      677
      14:08:03,680 --> 14:08:05,309
      Eric Wulff: computing notes.

      678
      14:08:05,330 --> 14:08:06,550
      Eric Wulff: Um,

      679
      14:08:06,990 --> 14:08:15,439
      Eric Wulff: maybe particularly looking at the plot to the right here we can see that the scaling for this use case is actually better than linear

      680
      14:08:15,570 --> 14:08:20,269
      Eric Wulff: um, which at least in part has to do with, uh

      681
      14:08:20,820 --> 14:08:26,109
      Eric Wulff: some excessive reloading of models that happens when when we have the few notes.

      682
      14:08:28,060 --> 14:08:29,150
      Eric Wulff: Um.

      683
      14:08:31,070 --> 14:08:35,830
      Eric Wulff: So. Um: Well, this basically means that the more the more

      684
      14:08:36,030 --> 14:08:41,099
      Eric Wulff: notes we have, the more people we have with the faster we can tune and apply these bottles.

      685
      14:08:41,670 --> 14:08:47,480
      Eric Wulff: That's all I had for for this.

      686
      14:08:48,740 --> 14:08:58,029
      Enrico Fermi Institute: Can you tell a priori from the model that that you'll that the model you're using will

      687
      14:08:58,080 --> 14:09:04,340
      Enrico Fermi Institute: force up behavior, so that if someone comes with any given model, you know how to sort of shape the work,

      688
      14:09:06,550 --> 14:09:15,609
      Enrico Fermi Institute: if you understand what I mean and no, no. What I mean is, you discovered that you get better than linear scaling with this training?

      689
      14:09:15,700 --> 14:09:16,719
      Right?

      690
      14:09:17,160 --> 14:09:22,499
      Enrico Fermi Institute: That's not always the case with, Or is that the case with any given model.

      691
      14:09:23,150 --> 14:09:24,459
      Um,

      692
      14:09:25,150 --> 14:09:33,199
      Eric Wulff: yeah, I think it would be so. This is sort of uh. This is showing the scaling of the hyper parameter organization itself.

      693
      14:09:33,650 --> 14:09:40,180
      Eric Wulff: Um, so it's not. If if you had just a single training, it wouldn't scale like this it would be

      694
      14:09:40,360 --> 14:09:42,610
      Eric Wulff: a a bit worse than linear probably.

      695
      14:09:45,610 --> 14:09:51,289
      Eric Wulff: But So the way that the hypertuning works in this case is that we

      696
      14:09:51,430 --> 14:09:53,199
      Eric Wulff: we launched a bunch of

      697
      14:09:53,690 --> 14:09:56,980
      Eric Wulff: trials in parallel with different type of parameter

      698
      14:09:57,010 --> 14:09:58,559
      Eric Wulff: configurations.

      699
      14:09:58,990 --> 14:10:00,189
      Eric Wulff: And then

      700
      14:10:00,340 --> 14:10:01,780
      Eric Wulff: um!

      701
      14:10:02,230 --> 14:10:10,820
      Eric Wulff: There is a sort of a scheduling or search algorithm, looking at how well all these trials perform,

      702
      14:10:10,940 --> 14:10:22,829
      Eric Wulff: and then it's a terminates once that look less promising and continuous training, the ones that look promising. And then we can also have some kind of base and optimization

      703
      14:10:23,190 --> 14:10:26,360
      Eric Wulff: component here, which tries to predict which

      704
      14:10:27,470 --> 14:10:31,230
      Eric Wulff: hyper parameters would perform. Well, and then we try those next,

      705
      14:10:32,930 --> 14:10:39,059
      Enrico Fermi Institute: and if you were to double or triple the number of nodes you would continue to does the

      706
      14:10:39,310 --> 14:10:42,929
      Enrico Fermi Institute: does the actual growth begin to flat now?

      707
      14:10:43,430 --> 14:11:00,910
      Eric Wulff: Um! I I haven't tested this um more than up to twenty four notes uh, so I can't say for sure, but I I imagine it will continue for at least a bit more. But um I I can't say for how long, and

      708
      14:11:01,060 --> 14:11:16,039
      Enrico Fermi Institute: I I also see that eventually it would flack off.

      709
      14:11:17,080 --> 14:11:18,540
      Eric Wulff: Um,

      710
      14:11:19,510 --> 14:11:23,909
      Enrico Fermi Institute: because that's all it. Yeah, it it's nothing. The issue is a resource contention.

      711
      14:11:24,600 --> 14:11:30,520
      Eric Wulff: Yeah, it's a that has to do with the with this search. Algorithm That's

      712
      14:11:30,630 --> 14:11:32,309
      Eric Wulff: um

      713
      14:11:33,180 --> 14:11:39,990
      Eric Wulff: trains a few trials and then terminates bad once, and then continues with new ones. So

      714
      14:11:40,360 --> 14:11:48,789
      Eric Wulff: if you have more more trials than you have notes that that you want to run uh. You have to sort of the

      715
      14:11:49,280 --> 14:11:54,179
      Eric Wulff: post trials at some point, and can and start training other ones.

      716
      14:11:54,590 --> 14:11:56,110
      Eric Wulff: Um!

      717
      14:11:56,270 --> 14:12:02,699
      Eric Wulff: Because you need to trade all the trials up to the same epoch number before you decide which ones to keep, and not

      718
      14:12:04,140 --> 14:12:11,450
      Eric Wulff: so it. It. It doesn't have to do with ray tune per se. It just has to do with the the particular search algorithm or

      719
      14:12:11,530 --> 14:12:15,219
      Eric Wulff: a lot of search algorithms actually work work like that.

      720
      14:12:18,070 --> 14:12:19,019
      Yeah,

      721
      14:12:19,250 --> 14:12:21,929
      Enrico Fermi Institute: you have a question or comment for me in the chat.

      722
      14:12:22,100 --> 14:12:40,870
      Ian Fisk: Yeah, I had a question for Eric which was, and maybe it's too early to tell. But my question was, how stable you expected the hyper parameter tuning to be in the sense that are we expecting that every time we change network or get new data, we're going to have to re-optimize the hyper parameters. Or is this something that

      723
      14:12:40,880 --> 14:12:50,119
      Ian Fisk: um that once we sort of ha I optimize for a particular problem that we may find that those are stable over periods of time. The reason, I ask is that This seems like A.

      724
      14:12:50,620 --> 14:12:59,900
      Ian Fisk: When we talk about the use of Hpc. Or clouds and specialized resources, like training is A is a big part of how we tend to use them. But the hyper parameter

      725
      14:13:00,190 --> 14:13:11,330
      Ian Fisk: optimization sort of increases that by a factor of fifty or so. And so, if we have to do it each time. We probably need to factor those things in in our thoughts about how we're where we're constrained resources.

      726
      14:13:12,110 --> 14:13:14,099
      Eric Wulff: Yeah, so

      727
      14:13:14,770 --> 14:13:16,039
      Eric Wulff: um,

      728
      14:13:16,760 --> 14:13:23,389
      Eric Wulff: it. It would completely depend on how much you change your model, or how much you change the problem.

      729
      14:13:23,470 --> 14:13:24,989
      Eric Wulff: I mean, if you're

      730
      14:13:25,010 --> 14:13:27,139
      Eric Wulff: if you change your model

      731
      14:13:27,180 --> 14:13:32,739
      Eric Wulff: architecture, I it, you will probably have to run a new hyper primary organization.

      732
      14:13:32,770 --> 14:13:38,310
      Eric Wulff: Um, because you might do not even have the same hyper parameters in your model anymore.

      733
      14:13:38,550 --> 14:13:40,150
      Eric Wulff: Uh,

      734
      14:13:40,610 --> 14:13:56,560
      Eric Wulff: and but But you know, if if things aren't two different, you might not have to to hypertune, you might. You might, or just, maybe, as to a small hyper tuning, you know, just a few parameters in in some narrow or small search space.

      735
      14:13:56,690 --> 14:13:58,640
      Eric Wulff: So, for instance,

      736
      14:13:59,020 --> 14:14:00,809
      Eric Wulff: if you look at other

      737
      14:14:00,840 --> 14:14:01,950
      Eric Wulff: uh

      738
      14:14:02,920 --> 14:14:06,280
      Eric Wulff: ah, other fields, such as, for instance, a

      739
      14:14:06,390 --> 14:14:09,070
      Eric Wulff: image recognition, or all the detection.

      740
      14:14:09,210 --> 14:14:26,879
      Eric Wulff: Um, if you find a network that performs well on, you know, classifying certain kinds of objects uh, then it's very likely that they, you know, using the same, have a parameters. It would be good at classifying other kinds of objects as well. If you just have labour data for for those objects.

      741
      14:14:26,890 --> 14:14:29,329
      So in that case, probably you wouldn't have to

      742
      14:14:31,100 --> 14:14:33,510
      Eric Wulff: run a full hyper-prampt organization again.

      743
      14:14:37,260 --> 14:14:46,599
      Ian Fisk: Thanks. It's a it's it's thanks. It's It's very impressive. The amount that it improves the situation by doing the separately. Getting a factor of two is nice

      744
      14:14:48,460 --> 14:14:49,360
      Eric Wulff: Thanks.

      745
      14:14:50,810 --> 14:15:07,050
      Paolo Calafiura (he): A question or comment from Paul. Yes, I hope to the question I miss. I missed the the first couple of nights. Sorry of the question. I wasn't address there. So my question is here You're starting to show the the scaling at four nodes,

      746
      14:15:07,060 --> 14:15:13,339
      Paolo Calafiura (he): and I wonder what would the scaling look like if you compare it with a single null or in a single gpu.

      747
      14:15:14,870 --> 14:15:16,540
      Eric Wulff: Um.

      748
      14:15:26,890 --> 14:15:32,669
      Eric Wulff: The few notes you have the more all this excessive reloading has to happen.

      749
      14:15:32,930 --> 14:15:37,320
      Eric Wulff: So you're just just using one now would be very, very slow.

      750
      14:15:37,510 --> 14:15:50,440
      Paolo Calafiura (he): But that's because of the way does it does this business. It's because of the search algorithm we use. So it's not the way to per. Say It's the

      751
      14:15:51,360 --> 14:15:58,859
      Eric Wulff: it's because of the algorithm you. You wouldn't be able to run this faster with another framework. Well, I mean

      752
      14:15:59,760 --> 14:16:18,139
      Paolo Calafiura (he): it. It. It. It's the algorithms problem, not way, too. So it's It's a little bit harder than to to do the the comparison. I mean, i'm thinking, if you use psychic labels like it, optimize on single Gpu to do to do the same thing. And then, of course, there is the question, What is the

      753
      14:16:22,910 --> 14:16:26,699
      Paolo Calafiura (he): Okay, it's It's a complicated question.

      754
      14:16:29,870 --> 14:16:32,029
      Okay? Next we have

      755
      14:16:34,400 --> 14:16:45,700
      Shigeki: Uh yeah, I'm gonna show my ignorance here. Um, just trying to understand exactly how this works. Uh: I think i'm on the first slide. Second slide.

      756
      14:16:45,730 --> 14:16:54,140
      Shigeki: You show the trial one trial to trial trial, and those trials are independent of each other. Right? They're all working on.

      757
      14:16:54,440 --> 14:17:12,849
      Shigeki: Okay, uh. The next thing here is that presumably they're they're reading the same set of data over in a uh in order to train uh, they don't. They're completely independent in terms of of the of the where they are in the input. Stream. Right? They're. They're not like working in lockstep or anything.

      758
      14:17:13,630 --> 14:17:25,690
      Eric Wulff: This is prior one. So it it. It depends on the kind of the search algorithm that you use the hyper perimeter search algorithm So um

      759
      14:17:26,590 --> 14:17:27,650
      Eric Wulff: in um.

      760
      14:17:28,350 --> 14:17:40,270
      Eric Wulff: Well, to to be with you. You you can choose not to use any particular search algorithm and then everything is just done uh in parallel sort of um,

      761
      14:17:40,560 --> 14:17:41,710
      Eric Wulff: however,

      762
      14:17:42,000 --> 14:17:53,250
      Eric Wulff: and it's it's It's much more efficient to use some kind of search. Algorithm So then um! You would want to train all the trials up to a certain

      763
      14:17:53,570 --> 14:17:58,200
      Eric Wulff: epoch number. Let's say you train them all up to you. Put five, and then you look at

      764
      14:17:58,530 --> 14:18:08,800
      Eric Wulff: uh, they have some algorithm that decides which wants to terminate, and which ones to continue training, and in place of the ones you terminated, you start new trials

      765
      14:18:08,820 --> 14:18:12,450
      Eric Wulff: with the with new hyper parameter configurations.

      766
      14:18:12,500 --> 14:18:19,529
      Eric Wulff: Um. So then, that if you have many more trials, then you have confused notes. You have to

      767
      14:18:19,720 --> 14:18:27,839
      Eric Wulff: pause some of some trials at a point five, and then load in new trials and train them out until they book five.

      768
      14:18:28,230 --> 14:18:30,749
      Shigeki: Okay. So

      769
      14:18:31,070 --> 14:18:35,280
      Shigeki: okay. But to a certain extent, though the the the trials are running independent,

      770
      14:18:35,290 --> 14:18:51,889
      Shigeki: and they get synchronized at some point by by the Atlantic. That you that you're that you're stopping at. But other than that within up to that epoch point uh they're running. They're they're what they're blasting through the the the data as quickly as they they they can. And And so they? They're not in sync. Okay,

      771
      14:18:52,640 --> 14:18:53,690
      Shigeki: thank you.

      772
      14:18:56,430 --> 14:18:59,330
      Enrico Fermi Institute: So how long does it take to run this on,

      773
      14:18:59,370 --> 14:19:07,800
      Enrico Fermi Institute: you know, for for one node? You know. How long is it running the the hyper parameter optimization in terms of all, all all time? Hours?

      774
      14:19:08,120 --> 14:19:09,599
      Eric Wulff: Um!

      775
      14:19:10,010 --> 14:19:11,059
      Eric Wulff: So

      776
      14:19:11,130 --> 14:19:21,010
      Eric Wulff: that that can vary a lot, depending on how large your search basis and the can, what we use and the data that we use, and so on, I think for the for the results I show here.

      777
      14:19:21,310 --> 14:19:22,860
      Eric Wulff: Um

      778
      14:19:23,820 --> 14:19:26,859
      Eric Wulff: uh, If I remember correctly,

      779
      14:19:27,120 --> 14:19:33,029
      Eric Wulff: the whole thing took uh around eighty hours in

      780
      14:19:33,190 --> 14:19:35,740
      Eric Wulff: in wall time,

      781
      14:19:35,980 --> 14:19:40,909
      Eric Wulff: and that's using uh that was using uh twelve

      782
      14:19:40,930 --> 14:19:45,800
      Eric Wulff: confused notes with four to us each.

      783
      14:19:45,810 --> 14:20:11,110
      Enrico Fermi Institute: That can be, you know, trivially broken up into into multiple drops and things like that. The reason I ask is one of the things I notice is that on you know some of the Hpcs uh, at least in the Us. Right. They they have, you know, maximum wall time, for you know you jobs in the queues right? So like I'm i'm looking at, you know pearl matter right now, and it says you can have a a gpu job in the regular queue uh for twelve hours at most.

      784
      14:20:11,120 --> 14:20:15,659
      Enrico Fermi Institute: And so i'm wondering like, what what useful work can we get done, or

      785
      14:20:15,870 --> 14:20:25,280
      Enrico Fermi Institute: you know, hyperparameter, optimization or machine learning in general, you know, given the relatively short maximum of all time.

      786
      14:20:25,450 --> 14:20:29,280
      Eric Wulff: Um. So one solution is to uh

      787
      14:20:29,460 --> 14:20:31,290
      Eric Wulff: tick points. The

      788
      14:20:31,950 --> 14:20:39,149
      Eric Wulff: the the search, and then just launch it again and continue where you left off. So the we're able to do that. So

      789
      14:20:39,190 --> 14:20:44,300
      Eric Wulff: we are saving checkpoints regularly through the the workload.

      790
      14:20:45,570 --> 14:20:47,679
      Eric Wulff: Okay? And uh, yeah,

      791
      14:20:47,820 --> 14:20:50,360
      Enrico Fermi Institute: how often do you save the checkpoints?

      792
      14:20:51,280 --> 14:21:07,169
      Eric Wulff: Um, That's configurable, But usually once per epoch. So once once per read through data sets.

      793
      14:21:08,020 --> 14:21:15,920
      Eric Wulff: Uh that. That depends a lot also. But um, let's say you around well between twelve and twenty four hours.

      794
      14:21:17,110 --> 14:21:20,540
      Eric Wulff: But this completely depends on how much data you have. And uh,

      795
      14:21:21,140 --> 14:21:24,060
      Eric Wulff: you know the the particular model they use.

      796
      14:21:24,530 --> 14:21:41,880
      Enrico Fermi Institute: That's an epoch for the hyper parameter optimization itself, not just the the neural net a single instance of the neural network

      797
      14:21:42,740 --> 14:21:45,710
      twenty-four hours for a single,

      798
      14:21:46,740 --> 14:21:53,449
      Eric Wulff: and that's um. So that you know we have quite a big data set. So that's

      799
      14:21:53,510 --> 14:22:00,430
      Eric Wulff: why. But we're also using four G four, and the J. One hundred gpus for that. So

      800
      14:22:00,820 --> 14:22:02,320
      Eric Wulff: if you have a

      801
      14:22:02,640 --> 14:22:05,420
      Eric Wulff: all the gpus that would take much longer,

      802
      14:22:08,980 --> 14:22:19,460
      Enrico Fermi Institute: I I guess What I'm wondering is, you know, for for the report, should we, you know, have some recommendation that the the policies at these sites. You know how

      803
      14:22:20,140 --> 14:22:25,540
      Enrico Fermi Institute: you know much longer Gpu jobs to run to do these sorts of tasks.

      804
      14:22:26,090 --> 14:22:29,069
      Eric Wulff: Well, my opinion is that it would be

      805
      14:22:29,720 --> 14:22:47,669
      Enrico Fermi Institute: it would be convenient to see if it if we could. But you know it's not deal breaking, because we can't checkpoint this, and just to relo right. But can you for it? You just said your your epochs are twelve to twenty-four hours, and Lincoln just said that

      806
      14:22:47,720 --> 14:22:57,990
      Eric Wulff: twelve hours. So the sorry sorry sorry. So I uh, yeah, yeah, I I I I spoke here so

      807
      14:22:58,500 --> 14:23:13,459
      Eric Wulff: uh apologies. It's a bit late over here. So it it takes it takes twenty-four hours for a full training. Not for one.

      808
      14:23:13,470 --> 14:23:33,439
      Enrico Fermi Institute: We're not asking for a policy change, right? Just a behavioral change with checkpointing. And you're saving at the end of each full training or each actual. So it's as much. Uh: yeah, yeah, Sorry for it. You have, like two hundred epochs. Is that right? Yeah, you're probably having the plot.

      809
      14:23:33,650 --> 14:23:37,789
      Eric Wulff: Uh: yeah, yeah, in the plot here. So Um:

      810
      14:23:38,030 --> 14:23:56,069
      Eric Wulff: yeah. And so the this is plot from last year. Now we have a large data set, and we train for about a hundred epochs, and that takes uh roughly, twenty four hours.

      811
      14:23:57,900 --> 14:23:59,820
      Enrico Fermi Institute: Okay, Um,

      812
      14:24:00,170 --> 14:24:13,310
      Enrico Fermi Institute: yeah, with adding more gpus per node help you in terms of a number of epochs? Or do you have enough data to get reasonable convergence with, or at least with this model after one hundred? You

      813
      14:24:21,110 --> 14:24:22,430
      Eric Wulff: actually we are.

      814
      14:24:22,690 --> 14:24:27,659
      Eric Wulff: We just saw that if we scale up our model

      815
      14:24:27,690 --> 14:24:40,729
      Eric Wulff: significantly, so make make the model larger. With many more parameters we can easily improve the physics performance. Um. So we just try that the

      816
      14:24:41,300 --> 14:24:44,330
      Eric Wulff: this week,

      817
      14:24:44,660 --> 14:24:47,859
      Eric Wulff: because we were curious. Basically Uh, however,

      818
      14:24:47,920 --> 14:24:49,790
      Eric Wulff: that's sort of not a

      819
      14:24:58,390 --> 14:25:02,050
      Eric Wulff: quickly enough in production, anyway.

      820
      14:25:02,590 --> 14:25:03,639
      Eric Wulff: Um,

      821
      14:25:06,150 --> 14:25:08,350
      Eric Wulff: but it sort of shows that the

      822
      14:25:08,440 --> 14:25:17,159
      Eric Wulff: there is enough information in the data to do better. We just uh need to improve the model or the the training of the model somehow.

      823
      14:25:20,160 --> 14:25:25,100
      Enrico Fermi Institute: Okay, Um, see, you have your hand raised.

      824
      14:25:25,830 --> 14:25:42,530
      Shigeki: Uh, yeah, I just have a question on in terms of the amount of data you're going through, and the model size. Uh, I guess that's measured in terms of number of parameters as well as hyper parameters. And whether or not This Is Is there a Is there a a a a

      825
      14:25:42,540 --> 14:25:54,120
      Shigeki: size that that physics problems, and in atp tend to gravitate to, or it can be all over the map in terms of model size and data, set size and and number of hyper parameters.

      826
      14:25:55,040 --> 14:25:56,179
      Eric Wulff: Um!

      827
      14:25:56,320 --> 14:26:00,129
      Eric Wulff: So the number of heavy parameters. Um,

      828
      14:26:00,190 --> 14:26:07,620
      Eric Wulff: that's a little bit arbitrary, dependent on what you mean with have parameters. So if you

      829
      14:26:08,040 --> 14:26:10,180
      Eric Wulff: uh if you count

      830
      14:26:10,250 --> 14:26:11,389
      Eric Wulff: well,

      831
      14:26:11,430 --> 14:26:13,889
      Eric Wulff: you you you can configure

      832
      14:26:14,040 --> 14:26:23,330
      Eric Wulff: but very many things with our model. So if you, if you count all those hyper parameters, I don't know how many they are, but there are hundreds, and we don't two, not of them, because they're too many.

      833
      14:26:28,100 --> 14:26:33,720
      Eric Wulff: Uh, the the number of trainable parameters in the model is around one million,

      834
      14:26:34,130 --> 14:26:37,850
      Eric Wulff: so that's fairly small, if you

      835
      14:26:37,890 --> 14:26:39,450
      Eric Wulff: compared with other uh

      836
      14:26:40,090 --> 14:26:46,880
      Eric Wulff: other sciences, like image recognition, or natural language processing, then this is really a small model.

      837
      14:26:47,030 --> 14:26:48,389
      Eric Wulff: Um!

      838
      14:26:48,570 --> 14:26:50,480
      Eric Wulff: How we we think that

      839
      14:26:50,580 --> 14:26:52,679
      Eric Wulff: I I actually don't know the

      840
      14:26:53,190 --> 14:26:57,809
      Eric Wulff: the memory requirements that we have to uh

      841
      14:26:57,850 --> 14:27:05,289
      Eric Wulff: that here, too, if this would go into production at some point in the future. But I don't think we could go much larger

      842
      14:27:05,410 --> 14:27:19,759
      Eric Wulff: uh, at least not without uh doing some kind of conversation. Uh, we're training or post training, conversation, or perhaps pruding weights after training or doing some other tricks like that

      843
      14:27:19,990 --> 14:27:23,109
      Eric Wulff: uh data set size. So the

      844
      14:27:23,680 --> 14:27:26,389
      Eric Wulff: the one we are currently using.

      845
      14:27:30,540 --> 14:27:34,559
      Eric Wulff: I think it's around four hundred thousand events

      846
      14:27:35,000 --> 14:27:38,260
      Eric Wulff: collision events of of the the different kinds.

      847
      14:27:40,140 --> 14:27:44,790
      Shigeki: Do you have an approximate idea of how much actual gigabytes that is?

      848
      14:27:45,140 --> 14:27:46,559
      Eric Wulff: Um

      849
      14:27:47,210 --> 14:27:48,730
      Shigeki: auto-

      850
      14:27:49,250 --> 14:27:51,920
      Eric Wulff: is a few hundred gigabytes

      851
      14:27:52,100 --> 14:27:54,480
      Eric Wulff: less than a thousand,

      852
      14:27:55,530 --> 14:28:08,920
      Shigeki: and presumably when you're when you're running this, it's, it's it's it's it's compute bound not not not uh a I o bound from from uh in terms of feeding the they uh, the the training data,

      853
      14:28:08,950 --> 14:28:11,229
      Shigeki: or it depends.

      854
      14:28:11,450 --> 14:28:18,439
      Eric Wulff: No, I would say it's compute bound. Oh, you mean looking at the Gpu utilization. It goes to

      855
      14:28:18,590 --> 14:28:20,070
      Eric Wulff: it close to one hundred.

      856
      14:28:20,139 --> 14:28:22,229
      Shigeki: Mhm Okay, thanks.

      857
      14:28:22,559 --> 14:28:27,009
      Enrico Fermi Institute: And you know how much of the memory and the Gpu you're using, or have you?

      858
      14:28:27,570 --> 14:28:30,279
      Eric Wulff: Uh, yes, we uh we,

      859
      14:28:30,400 --> 14:28:33,209
      Eric Wulff: you see, all of it. Basically

      860
      14:28:34,049 --> 14:28:40,529
      Enrico Fermi Institute: So then you're not. It would not help you to have centers that chop up these big gpus.

      861
      14:28:41,969 --> 14:28:45,769
      Eric Wulff: I don't think so. Um. So there is a problem.

      862
      14:28:45,930 --> 14:28:57,160
      Eric Wulff: Um, with having two large batch sizes sometimes. Um basically in order to fill up the gpu. You you increase the bad size as you use for training.

      863
      14:28:57,230 --> 14:28:58,449
      Eric Wulff: Um,

      864
      14:28:59,530 --> 14:29:05,829
      Eric Wulff: and that means you can push more date, though,

      865
      14:29:05,850 --> 14:29:14,719
      Eric Wulff: through per time units, but you know it. It doesn't necessarily mean you can do more optimization steps. So you you might not

      866
      14:29:14,879 --> 14:29:17,020
      Eric Wulff: uh reach

      867
      14:29:17,160 --> 14:29:20,090
      Eric Wulff: the same accuracy quicker.

      868
      14:29:26,629 --> 14:29:38,190
      Eric Wulff: It's it's not obvious or so it's always the case that you can just uh throw more memory at it than it helps. Yeah, I was actually thinking of swapping it the other way with. Uh,

      869
      14:29:38,990 --> 14:29:45,470
      Enrico Fermi Institute: we have a question in our data center of how much we should chop up using Meg the a one hundreds,

      870
      14:29:47,480 --> 14:29:50,440
      Enrico Fermi Institute: you know. Give person a whole

      871
      14:29:51,010 --> 14:29:54,830
      Enrico Fermi Institute: eighty gigs. Were split it up two ways or four ways

      872
      14:29:55,139 --> 14:30:03,550
      Eric Wulff: uh to to several users at the same time.

      873
      14:30:05,549 --> 14:30:06,580
      Enrico Fermi Institute: Thanks.

      874
      14:30:07,530 --> 14:30:09,519
      Enrico Fermi Institute: Show another comment:

      875
      14:30:12,860 --> 14:30:17,950
      Enrico Fermi Institute: Sorry I got to the

      876
      14:30:18,650 --> 14:30:27,329
      Dirk: yeah. I I had a question, and it's not so much. I mean, Eric, if you know you can answer, but it's more uh looking at broader,

      877
      14:30:27,559 --> 14:30:38,899
      Dirk: the and more broader impact of that, and follow on because this is this is interesting, and this is on the But What's the next step? Have there been any discussions how

      878
      14:30:38,969 --> 14:30:41,610
      Dirk: to integrate this in like?

      879
      14:30:41,700 --> 14:30:58,269
      Dirk: Eventually? You You said It's work. It's improving particle. Flow. So eventually it should feed back into the Re. How we run the the reconstruction? Basically, And then the question comes, uh, what, how would you actually deploy this? How often do you have to run it?

      880
      14:30:58,540 --> 14:31:19,770
      Dirk: How long does it take? And and how often do I have to renew like, renew it Basically, with new data to to check that the parameters are still okay and has has, and it's not just a question about the specific thing that So this is like the larger questions. Maybe Lindsay or I don't know if Mike might ask for Link connected if there have been any

      881
      14:31:19,780 --> 14:31:26,789
      Dirk: discussions of that already, or or if that's still to come after the on. The initial on D is done.

      882
      14:31:30,130 --> 14:31:33,150
      Eric Wulff: Well, I would say, if uh,

      883
      14:31:33,470 --> 14:31:36,980
      Eric Wulff: if we are able to prove, or

      884
      14:31:37,030 --> 14:31:38,920
      Eric Wulff: somehow show, that

      885
      14:31:39,020 --> 14:31:43,090
      Eric Wulff: this machine learned approach to particle flow works

      886
      14:31:43,170 --> 14:31:44,490
      Eric Wulff: uh

      887
      14:31:44,880 --> 14:31:52,579
      Eric Wulff: as well, but more efficiently, or or even uh better than the uh

      888
      14:31:52,610 --> 14:31:54,660
      Eric Wulff: method that are used at the moment.

      889
      14:31:55,670 --> 14:31:59,449
      Eric Wulff: Um. Then we then we sort of free that model and

      890
      14:31:59,690 --> 14:32:04,779
      Eric Wulff: get it into production, and then we shouldn't need to redo any hyper,

      891
      14:32:04,820 --> 14:32:34,339
      Dirk: current documentation or anything like that. Then it, you know we Then it's like having a finished algorithm, that. Just Yeah. But the data taking the detector changes all the time. So who knows if the twenty right. If if the training you did on two thousand and twenty-two data, or even run two data is still valid for your next set of data. That's right. So we're we're not. We're not training on date, but we're trying a simulation. Okay, right. But but I think this is when we talk about these kind of a problems, and one of things needs to be studied

      892
      14:32:34,580 --> 14:32:44,590
      Ian Fisk: is how stable these are, and whether they really like, cause it could be that we're incredibly lucky, and they once you hype once you do the hyper parameter optimization that it's applicable to

      893
      14:32:45,180 --> 14:32:51,009
      Ian Fisk: small changes in data. Um, And one thing that this I think we can see from Eric's plots is that it?

      894
      14:32:51,050 --> 14:33:01,189
      Ian Fisk: It makes these things faster. They train faster and better after they can optimize. And so if we were in unreasonably lucky, they'll actually save us resources.

      895
      14:33:02,360 --> 14:33:03,300
      Okay,

      896
      14:33:03,500 --> 14:33:08,860
      Dirk: Okay. But it sounds like It's a discussion that's still to come. That's not. We're not quite there yet.

      897
      14:33:09,400 --> 14:33:25,109
      Ian Fisk: Well, I think so. I think the the thing we do is we given how much this improves the situation where chances are. And I think this is applies to multiple science fields, not just ourselves, that we should be factoring these things in in our discussion about how we're going to use Hc.

      898
      14:33:25,140 --> 14:33:35,829
      Ian Fisk: Um for the report. Um! And then we'll have to wait and see as to whether this thing that's a a workful that we're constantly running, or one that we are running once in a while.

      899
      14:33:39,190 --> 14:33:47,179
      Mike Hildreth: Yeah, I guess I would agree with that. Um, I don't. Yeah, we haven't had A. We. We don't have enough data.

      900
      14:33:47,840 --> 14:33:53,670
      Mike Hildreth: How often we're going to have to train these. But this use case is certainly in the planning.

      901
      14:33:54,080 --> 14:33:55,760
      Enrico Fermi Institute: Is it right?

      902
      14:33:55,850 --> 14:34:07,809
      Enrico Fermi Institute: I think the one remaining worry is, we haven't been through like a complete recalibration cycle of the detector. Uh uh, after a stop or anything like that to see if

      903
      14:34:07,820 --> 14:34:21,400
      Enrico Fermi Institute: to see if it or to see how robust a single training is, or the most optimal training is. With respect to the changing parameters of the detector, and it's just something we have to find out. But it's not going to change the pattern. All that much to be honest.

      904
      14:34:21,410 --> 14:34:28,360
      Enrico Fermi Institute: But yeah, I agree with the in here. It's this: This is probably going to save us resources as well in the long run.

      905
      14:34:28,620 --> 14:34:30,320
      Dirk: Okay, thanks.

      906
      14:34:30,510 --> 14:34:38,550
      Dirk: That makes it difficult for us to write because we can write the use case in, but it's extremely hard to attach any numbers to it at the moment.

      907
      14:34:41,470 --> 14:34:55,099
      Enrico Fermi Institute: Yeah, I mean, we, I guess, to another way to summarize it. We've shown that this works, and that we can get really great results out of it, but we haven't understood the true uh, you know, steady state operational parameters of of this system.

      908
      14:34:59,230 --> 14:35:04,370
      Eric Wulff: And just to be clear like you, there there still needs to be a

      909
      14:35:04,610 --> 14:35:08,699
      Eric Wulff: quite a bit of work before this would be ready to go into production.

      910
      14:35:09,140 --> 14:35:10,600
      Eric Wulff: It's still

      911
      14:35:10,880 --> 14:35:14,050
      Eric Wulff: uh like we, we, we don't understand

      912
      14:35:14,200 --> 14:35:18,509
      Eric Wulff: all the properties of how it reconstructs particles well enough. Yet,

      913
      14:35:20,650 --> 14:35:23,980
      Eric Wulff: although you know it's looking good, it's. It's looking promising,

      914
      14:35:24,230 --> 14:35:30,350
      Eric Wulff: but it it needs to be validated and much more before production.

      915
      14:35:41,060 --> 14:35:44,129
      Enrico Fermi Institute: So we have more question, for

      916
      14:35:46,660 --> 14:35:50,649
      Enrico Fermi Institute: I guess one silly question

      917
      14:35:51,140 --> 14:36:03,900
      Enrico Fermi Institute: in terms of actually trying to use this like in Cmssw. And this is mostly because I don't remember the last time that Joseph presented this, How fast does this go per event in inference mode?

      918
      14:36:04,220 --> 14:36:06,810
      Enrico Fermi Institute: How many, what's the throughput look like?

      919
      14:36:06,940 --> 14:36:24,380
      Eric Wulff: Um, I don't think we have done anything there that would be comparable to it. Production? So it um, or maybe an even better question is, what's what's the memory footprint look like on Gpu or Cpu

      920
      14:36:24,770 --> 14:36:31,000
      Eric Wulff: uh, I don't know that on top of my head, but I know we have a plot somewhere that I can

      921
      14:36:31,100 --> 14:36:32,899
      Enrico Fermi Institute: all good. Thank you.

      922
      14:36:37,540 --> 14:36:46,069
      Enrico Fermi Institute: Okay. There are no other questions, and we can. You can move on. Ah, thank you very much for the presentation, Eric.

      923
      14:36:46,360 --> 14:36:48,119
      Eric Wulff: No problem. Thanks for listening.

      924
      14:36:51,760 --> 14:37:05,870
      Enrico Fermi Institute: Okay. Having some minor networking challenges here with the pay attention.

      925
      14:37:06,270 --> 14:37:12,320
      Enrico Fermi Institute: I think that I think the room is still okay with sharing It's just in everybody's laptop. It's connected as your own as a

      926
      14:37:12,570 --> 14:37:27,260
      Enrico Fermi Institute: yeah. The wired connection is fine. Yeah, um, dirk. You wanted to to talk about some more machine learning related topics while we were on this. Do you want to? Did you have other particular things to bring up? And then we bounce back to the network and stuff like that.

      927
      14:37:27,780 --> 14:37:43,740
      Dirk: I think I think we can. We can follow the regular. We We put the machine learning early, because uh, because of some time constraints, but it's. I think it's. Maybe we do the rest of the R. And D as part of the the normal.

      928
      14:37:43,780 --> 14:37:45,420
      Enrico Fermi Institute: Okay.

      929
      14:37:45,550 --> 14:37:49,740
      Dirk: So figure out how to use this. So we were at impact. Right?

      930
      14:37:49,770 --> 14:37:50,810
      Enrico Fermi Institute: Yeah,

      931
      14:37:50,880 --> 14:37:56,380
      Dirk: I I can say a little bit something about that. And I think we we discussed some of that yesterday already,

      932
      14:37:56,630 --> 14:38:01,189
      Dirk: but it's it's also including cloud. Now. So we're looking at both. Um.

      933
      14:38:01,450 --> 14:38:19,450
      Dirk: So what happens if we actually start using a lot of Hpc. And cloud users uh how the integration with I mean, at the moment we run them opportunistically. So they are considered an add on. But if we ever get to a point where they like a large fraction of our overall resources.

      934
      14:38:19,660 --> 14:38:37,420
      Dirk: What's the impact on our on our global computing infrastructure? And how does it impact our own, the owned resource that that are still in the mix. So it's basically you. You can look at that. And you basically would have a lot of compute external to our own resources in some way.

      935
      14:38:37,470 --> 14:38:43,410
      Dirk: And then you you look at What does that mean for our own sides? What kind of changes

      936
      14:38:44,440 --> 14:39:02,069
      Dirk: might potentially be needed there to facilitate large scale cloud, and to a large degree that will dispense on on on how much are we actually using the storage at Cloud to the Hpc. So if you, if you consider that you don't have any storage, and you have to stream, or some other way to get the data

      937
      14:39:02,110 --> 14:39:09,129
      Dirk: in and out quickly and just process it on demand that that puts more pressure on our own sides. Well versus

      938
      14:39:09,150 --> 14:39:22,100
      Dirk: if you look at what? Atlas, that they have a self-contained site. That's more that follows more. The the model of just bring up another side somewhere else on some external resources. But it's kind of mostly a self-contained sun.

      939
      14:39:22,530 --> 14:39:23,539
      Dirk: Um!

      940
      14:39:23,700 --> 14:39:42,489
      Dirk: The other impact is that if you want to, if if we decide tomorrow, for instance, that um our codes performs great on arm, and we should uh switch to it as much as possible, because it's more cost effective. You can actually do that much quicker on the cloud, like, for instance, for that Google side in principle

      941
      14:39:43,130 --> 14:39:59,530
      Dirk: at this could decide tomorrow that oh, from now on we're providing arm, cpu or not into cpu anymore. Um, because you change the instance type. Um! You can't do that on our own resources. That's a much longer process of a multiple years to swap out resources.

      942
      14:39:59,560 --> 14:40:01,730
      Dirk: And uh, yeah,

      943
      14:40:01,830 --> 14:40:21,730
      Dirk: And the the other obvious issues is even if we get storage at the Uh Cloud Hbc: sites you have to uh worry about transfers, because all these these resources need to be integrated in our transfer infrastructure. We need to, uh have Rosie be able to connect somehow. Uh, maybe uh

      944
      14:40:22,480 --> 14:40:23,449
      Dirk: have

      945
      14:40:24,520 --> 14:40:35,929
      Dirk: it. It mentions in intermediary node services. I know B. And L. Has some global online endpoint that atlas to facilitate transfers to some Hbc. And things like that, so

      946
      14:40:36,540 --> 14:40:52,250
      Dirk: that feeds directly into the last point, feeds directly into network integration. So it's not just the transfer services, but also the underlying transfer fabric, the the network connectivity of of the Cloud and Hpc. Sites on Hbc. Resources.

      947
      14:40:57,530 --> 14:41:07,960
      Dirk: As I said, we discussed it yesterday, some of it already. And uh, the one comment was that we should break out hardware and service costs that are basically

      948
      14:41:08,930 --> 14:41:11,800
      Dirk: so anything else, any other comments on this

      949
      14:41:17,770 --> 14:41:20,830
      Enrico Fermi Institute: one of the things that we had talked about in our,

      950
      14:41:21,430 --> 14:41:33,769
      Enrico Fermi Institute: you know, just discussions among the blueprint group. Uh, you know, before the the workshop here was Is there any impact on

      951
      14:41:34,040 --> 14:41:46,960
      Enrico Fermi Institute: on grid sites? If we were to, you know, do something like shift, you know large amounts of certain kinds of workflows to cloud like we did a lot of,

      952
      14:41:46,980 --> 14:41:53,320
      Enrico Fermi Institute: you know, a lot more simulation on which Pc. We have to, you know,

      953
      14:41:53,540 --> 14:42:01,009
      Enrico Fermi Institute: with the with the tier twos run correspondingly more analysis or something like that? If that were the case, would they have to

      954
      14:42:01,330 --> 14:42:04,189
      Enrico Fermi Institute: up their facilities in certain ways,

      955
      14:42:05,200 --> 14:42:12,550
      Enrico Fermi Institute: or does that not make sense at all. Should we just anticipate that We'll be able to run all workload types and all all resources,

      956
      14:42:13,990 --> 14:42:15,060
      things like that?

      957
      14:42:16,390 --> 14:42:19,609
      Enrico Fermi Institute: I see There's a hand raised from Eric

      958
      14:42:34,640 --> 14:42:37,089
      Eric Lancon: to export um

      959
      14:42:37,220 --> 14:42:39,349
      Eric Lancon: the Cpu processing

      960
      14:43:17,160 --> 14:43:18,999
      Eric Lancon: at the same site.

      961
      14:43:23,900 --> 14:43:40,379
      Dirk: Yeah, that's something we we we worried about because the the impact on the data transfers for formula specifically, because if you look at how we designed the the Hep cloud where we basically treat the Hpc. As an external compute resource, and then most of the

      962
      14:43:40,540 --> 14:43:53,389
      Dirk: the Dio and the data it actually goes through Fermi lab that this will. So far everything is holding up nicely, but eventually, as we scale up Hpc: use. There's probably going to be an impact on

      963
      14:43:53,480 --> 14:43:58,259
      Dirk: on provisioning of of network and storage at at formula

      964
      14:44:22,190 --> 14:44:23,250
      Um.

      965
      14:44:23,340 --> 14:44:25,430
      Enrico Fermi Institute: Other comments on

      966
      14:44:25,660 --> 14:44:30,349
      Enrico Fermi Institute: impacted Hpc Cloud use on the existing infrastructure.

      967
      14:44:37,250 --> 14:44:39,300
      Steven Timm: I just say what I heard

      968
      14:44:39,420 --> 14:44:42,249
      Steven Timm: you might not think about.

      969
      14:44:42,480 --> 14:44:43,560
      Steven Timm: Uh.

      970
      14:44:43,730 --> 14:44:48,970
      Steven Timm: This was not a Cms. To us. This was, but we were running a very

      971
      14:44:49,080 --> 14:44:58,119
      Steven Timm: um. And then, calling with the Google Code for a newference server, we managed to saturate the same network before we live for short time between us and Google.

      972
      14:44:59,330 --> 14:45:02,110
      Steven Timm: So uh, you can.

      973
      14:45:02,280 --> 14:45:06,529
      Steven Timm: If you're doing inference, you have to be careful of your um.

      974
      14:45:17,080 --> 14:45:29,849
      Enrico Fermi Institute: I have what is possibly a profoundly uninformed question, how much of our, how much of our Monte Carlo generation at the at the actual generator level is being

      975
      14:45:29,860 --> 14:45:38,420
      Enrico Fermi Institute: done uh or well. But what is uh taking place on Gpus like using using Gpus to do the Monte Carlo integration, and i'm waiting

      976
      14:45:40,460 --> 14:45:56,779
      Enrico Fermi Institute: because that is a significant fraction of time that we spend right now. I mean, that's what uh Alison and Cms zero, because uh, I mean a very quick search on the Internet informs us that uh,

      977
      14:45:56,790 --> 14:46:15,759
      Enrico Fermi Institute: they could, or so one that Gpu and Monte Carlo integration has been around for ten more than ten years now, and to that the factors speed up for that integration is like a factor of fifty or something. Uh, though of course, this probably depends on the shape of the thing that you're integrating, and how many polls it has and whatnot.

      978
      14:46:15,870 --> 14:46:24,899
      Enrico Fermi Institute: But has anyone looked at benchmarking that, And could it have a major impact if we could significantly reduce the

      979
      14:46:25,020 --> 14:46:28,389
      Enrico Fermi Institute: the time to integrating

      980
      14:46:28,420 --> 14:46:37,380
      Enrico Fermi Institute: time to getting an integrated cross-section, and then also the time to unleading the necessary whatever necessary amounts of events.

      981
      14:46:37,460 --> 14:46:48,019
      Enrico Fermi Institute: And could that fit on the hpc resources better. Could we use that in any way? I'm not sure that after that this goes really open-ended but it seems like It's something we're not considering

      982
      14:46:48,150 --> 14:46:54,739
      Enrico Fermi Institute: because it's a It would be a really nice way to hide a lot of the latency in our production workloads right now,

      983
      14:46:55,120 --> 14:46:57,210
      Enrico Fermi Institute: so get rid of it, not even Highland

      984
      14:47:00,660 --> 14:47:13,370
      Enrico Fermi Institute: Uh. Had. Yeah, this this was a really open, ended question. But have we have we looked at that? And uh, if we're not doing it now. After ten years There must be something wrong,

      985
      14:47:13,420 --> 14:47:20,730
      Dirk: maybe, Lindsay, but you and Mike should be in the best position to be able to answer that question in terms of

      986
      14:47:21,190 --> 14:47:25,329
      Enrico Fermi Institute: for something that is that old.

      987
      14:47:25,360 --> 14:47:31,520
      Enrico Fermi Institute: There's either something wrong with it, or we've actually just not been paying attention to it for a decade.

      988
      14:47:31,530 --> 14:47:46,870
      Enrico Fermi Institute: Um. So I have. Yeah, and I I I personally don't have any information on that, Mike. Do you have anything? I think the answer is zero as well, you know. So why why are we using this? That's kind of a weird one?

      989
      14:47:47,170 --> 14:47:49,719
      Steven Timm: There's been studies recently that almost

      990
      14:47:49,890 --> 14:48:01,270
      Steven Timm: the dominant part of generation is actually throwing the dates and rolling around numbers. But I don't know if that's true for him as a but I know it's through for doing so. I mean, could you envision a situation where you're

      991
      14:48:01,280 --> 14:48:16,659
      Enrico Fermi Institute: It's either. So any random numbers for you and nothing else. Uh yeah, I mean, that's that's probably what a large portion of it is that they're throwing lots of random numbers in parallel. Um! They have very good money, or very good uh Rngs uh for for Gpus.

      992
      14:48:17,070 --> 14:48:35,200
      Dirk: I I think the the question also goes a little bit out of scope, because we're not supposed to look into what's going on on the framework side and the software side, but maybe to to to to, and I mean from from the conversation I had with Muddy on a lot of the the We had this the effort to spend in terms of Gpu.

      993
      14:48:35,210 --> 14:48:39,789
      Dirk: I think it's this: The simple answer is, we looked at the full chain.

      994
      14:48:40,240 --> 14:48:51,369
      Dirk: Jen Sim, did you recall, plus whatever miscellaneous comes after? And then they decided that generation is not the primary target of

      995
      14:48:51,730 --> 14:49:02,819
      Dirk: a porting effort, because it's not over all that important for us. It's less important than reconstruction and tracking. I mean It's just lowest hanging fruit,

      996
      14:49:03,210 --> 14:49:16,139
      Dirk: and the picture changes, of course, depending on generator to generator. But that I think that's That's the simple answer. No effort. Focus on certain areas, and that's one of one of them that wasn't focused on.

      997
      14:49:16,150 --> 14:49:34,669
      Enrico Fermi Institute: Yeah, I can see that. That's a reasonable answer, I guess. Uh looking at kind of the the shape of the compute facilities that we are getting from Hpc. Uh. Packaging up some huge job that you uh, you know, send out to an Hpc. And then get your a lot of time and get your answer back. Uh. It seems

      998
      14:49:34,680 --> 14:49:46,849
      Enrico Fermi Institute: at least in terms of like the the the geometry or the topology of the that makes a lot more sense for the kind of resources we're talking about. But I understand that Rico is certainly a higher priority in terms of compute that

      999
      14:49:49,940 --> 14:49:53,110
      Enrico Fermi Institute: that's that's sort of where my thinking is heading, that's all,

      1000
      14:49:57,580 --> 14:49:59,379
      Enrico Fermi Institute: Steve. Did you have another comment?

      1001
      14:50:00,440 --> 14:50:01,420
      Steven Timm: No

      1002
      14:50:09,860 --> 14:50:13,999
      Enrico Fermi Institute: other comments here. Or should we move on to to network integration?

      1003
      14:50:21,620 --> 14:50:24,489
      Enrico Fermi Institute: Okay, Sounds like we should we should move on

      1004
      14:50:24,870 --> 14:50:25,860
      um

      1005
      14:50:26,190 --> 14:50:27,260
      Enrico Fermi Institute: screen.

      1006
      14:50:30,210 --> 14:50:51,389
      Enrico Fermi Institute: So yeah, one of the things we wanted to talk about was uh, you know just how how our sheer ones your choose are connected uh today, and sort of what the the the the plans are for that in the future. Um! Some of the some of the forward-looking stuff, um! And then we'll also have a presentation from from uh dale cart vesnet

      1007
      14:50:51,400 --> 14:50:59,409
      Enrico Fermi Institute: um to give us some of his thoughts as well. Yeah, One of the questions that comes up here is

      1008
      14:51:00,220 --> 14:51:08,870
      Enrico Fermi Institute: especially with the clouds. What can we do about connecting things to Lhs. You want and hearing all this business

      1009
      14:51:08,990 --> 14:51:14,459
      Enrico Fermi Institute: people like to talk about address costs. Is there anything any quick and easy thing we can do to reduce those

      1010
      14:51:14,770 --> 14:51:15,980
      Enrico Fermi Institute: um.

      1011
      14:51:16,900 --> 14:51:19,919
      Enrico Fermi Institute: So for site, connectivity? Um

      1012
      14:51:19,990 --> 14:51:32,529
      Enrico Fermi Institute: for Cms one hundred gigabit all the your two sites turn our gate with to to Fermi lab evolution of the of us-based site connectivity. There's plans to demonstrate

      1013
      14:51:32,690 --> 14:51:46,870
      Enrico Fermi Institute: over a hundred gigabit uh transfers of two thousand and twenty-three sensitive plans to have tier two's at four hundred gigabits in two thousand and twenty-five uh for me. Lab has plans for upgrades, but they're taking sort of a year by your approach. I don't know dirt if you want to add anything else to that.

      1014
      14:51:48,140 --> 14:51:59,569
      Dirk: No, that's basically it. I mean it. It's a little. All these plans are kind of tended. If we know we have to upgrade to get to H. And let it see, and it's going to be a process. But the exact schedule is a bit

      1015
      14:51:59,660 --> 14:52:02,299
      Dirk: and and undefined at the moment,

      1016
      14:52:02,330 --> 14:52:05,130
      Enrico Fermi Institute: and I should say that a lot of these

      1017
      14:52:05,250 --> 14:52:07,310
      Enrico Fermi Institute: plans were

      1018
      14:52:07,500 --> 14:52:09,719
      Enrico Fermi Institute: not said so, but there are.

      1019
      14:52:10,130 --> 14:52:29,889
      Enrico Fermi Institute: The plans were developed before the slip of the La, the agency schedule. So, um, you know i'd be willing. We're already talking about maybe pushing the the demonstration of greater than one hundred greater than one hundred gigabit transfers to twenty-four. Um. So now that we have a couple of more years, we're probably going to shift things back a bit

      1020
      14:52:32,770 --> 14:52:45,550
      Enrico Fermi Institute: on the Atlas side. Yeah, So we have a hard view, but it says a few, but it's really most of the tier two are basically at or near one hundred gigabits, some somehow more than

      1021
      14:52:45,560 --> 14:52:59,419
      Enrico Fermi Institute: hundreds of two by one hundred things like that. Um, The tier one, so I understand, has at least four by one hundred gigabit. So if i'm i'm just representing any of the sites. Just jump out and correct me. Um. And yeah,

      1022
      14:52:59,430 --> 14:53:15,340
      Enrico Fermi Institute: our expectation is that, you know, in the future we'll we'll have multiple hundred gigabytes of connectivity. Um, you know, one or more site may have uh four hundred gigabit links that I think a lot of it depends on the on the economics of uh of when it's sensible to. Uh to start buying four hundred.

      1023
      14:53:17,980 --> 14:53:24,780
      Enrico Fermi Institute: Yes, that plans. Yeah. So I think we now we can. We can jump to to Dale's presentation. If you're you're out there. Jail.

      1024
      14:53:25,670 --> 14:53:31,499
      Enrico Fermi Institute: Yes, sounds good. Okay, great. I'm going to stop share here and you can. You can start your share.

      1025
      14:53:35,730 --> 14:53:49,210
      Dale Carder: All Righty. All thanks for having me here today and feel free to to interrupt. I like this interactive approach a lot more than me, just uh preaching. So it's kind of got um an overview of

      1026
      14:53:49,220 --> 14:53:56,039
      Dale Carder: sort of the do we? Uh networking perspective on Hpc. Facilities. Tier one,

      1027
      14:53:56,070 --> 14:53:59,089
      Dale Carder: and then we'll get into some cloud stuff, and then

      1028
      14:53:59,430 --> 14:54:05,269
      Dale Carder: then I sort of trail off into where I have more questions than answers, which is, I guess, not surprising,

      1029
      14:54:05,460 --> 14:54:08,239
      Dale Carder: given What? Where some of these conversations have been.

      1030
      14:54:08,740 --> 14:54:15,859
      Dale Carder: So the biggest thing I want to emphasize with respect to not just where we are now. But you know,

      1031
      14:54:16,190 --> 14:54:35,530
      Dale Carder: through sort of the timeline between now and the beginnings of high luminosity. Lhc: And we we had this big, you know, uh process to build. Yes, net six, and some of the key components included, building our physical network into each Doe National lab,

      1032
      14:54:35,670 --> 14:54:48,760
      Dale Carder: and that means our fiber extends in there with our equipment collocated at the site um with Routers that we run there, so we can offer essentially any. Yes, net service at any national lab at full scale.

      1033
      14:54:49,880 --> 14:55:08,229
      Dale Carder: It's also. Now, the Esnet owns the optical equipment, and basically the end to end connectivity extremely cost effective uh to upgrade it. It's not going out and procuring uh circuits from vendors. You know things along that line. We're doing all of our optical engineering in house

      1034
      14:55:08,240 --> 14:55:14,650
      Dale Carder: now, so we can go out and buy modems from any vendor off the shelf, and to put them under our network after we qualify them.

      1035
      14:55:14,820 --> 14:55:24,800
      Dale Carder: So it's sort of a very different evolution model from sort of the traditional backbone approach of buying circuits and and linking things together on hot by Hop!

      1036
      14:55:25,670 --> 14:55:27,000
      Dale Carder: Um!

      1037
      14:55:27,070 --> 14:55:40,660
      Dale Carder: There was already a little bit showing of sort of like where we were at connectivity. Wise um for each of the Lcfs and nurse basically we're right now at this precipice of going from

      1038
      14:55:40,730 --> 14:55:59,839
      Dale Carder: um and by one hundred get connectivity to four hundred gig and class connectivity. Um sort of everyone's got a little bit slightly different timeline depending on on large part due to equipment, shortages and things of that sort, but generally across the the big deal we facilities. This is all kind of happening in parallel.

      1039
      14:55:59,880 --> 14:56:17,669
      Dale Carder: Um, There's yesterday. There's a lot of talk about nurse being sort of different than the Lcs. Which is fair. Um, they're targeting one terabit per second um basically into their their facility, and that's that's not through um the lab that's direct. And for me it's not to nurse

      1040
      14:56:18,780 --> 14:56:33,380
      Dale Carder: where I think this puts us, and you know we're like. At least I want to be is that the limiting factor is going to be at the site, you know, if we can basically show up to the door of Fermi Lab, or show you the door of

      1041
      14:56:33,470 --> 14:56:37,449
      Dale Carder: nerves wherever with essentially all you can eat connectivity you,

      1042
      14:56:37,800 --> 14:56:41,620
      Dale Carder: it's now onto the Border Router security junk

      1043
      14:56:41,990 --> 14:56:44,800
      Dale Carder: data transfer nodes and storage

      1044
      14:56:45,030 --> 14:56:49,839
      Dale Carder: where the scaling factors are going to be. Not necessarily the wide area now.

      1045
      14:56:50,640 --> 14:56:56,309
      Dale Carder: So that's that's sort of where where I think we're going to be, at least in the next

      1046
      14:56:56,520 --> 14:57:01,299
      Dale Carder: couple of years. We've got a long life cycle on, especially the optical network that we've built

      1047
      14:57:02,380 --> 14:57:05,220
      Dale Carder: there any questions sort of on this front,

      1048
      14:57:05,860 --> 14:57:08,820
      Dale Carder: and I think we will drift off into cloud stuff.

      1049
      14:57:10,580 --> 14:57:15,270
      Enrico Fermi Institute: So when is the four hundred Gigabit stuff expected to become

      1050
      14:57:20,030 --> 14:57:37,240
      Dale Carder: economical. So it's a funny term, but uh, it's almost more about availability right now. Can you buy equipment or not? And in some cases you can actually only buy the newer equipment because it's It's smaller uh fab sizes that are actually being produced

      1051
      14:57:37,250 --> 14:57:51,110
      Dale Carder: versus the larger tabs where you're competing with chips for dishwashers and things like that. So it's it's It's sort of this funny funny point. But in our in our conversations with

      1052
      14:57:51,300 --> 14:58:06,020
      Dale Carder: um I think we're up to like sixty or seventy of the tier two's. Nearly everyone has a has a plan for the next couple of years. It's either like next year or or right after that. So we're pretty much right at that point. Now,

      1053
      14:58:06,730 --> 14:58:24,369
      Dale Carder: a lot of that's driven by. You know the economics of these major um cloud data centers. So if you can buy equipment, sort of, you know, matching what the industry as a whole is buying, you're going to reap the words of that cost effectiveness. There

      1054
      14:58:25,910 --> 14:58:41,179
      Enrico Fermi Institute: is. Is there a concern that, like I know it doesn't apply for a lot of sites, but for for things like uh like firewalls and things like that. I know a lot of some. Some sites are more concerned about that than others like are the firewall appliances sort of keeping pace with the

      1055
      14:58:41,580 --> 14:58:54,109
      Dale Carder: i'll say no. I don't think there's truly been a demonstrated track record of that.

      1056
      14:58:54,420 --> 14:58:56,780
      Dale Carder: You know we still see

      1057
      14:58:57,000 --> 14:58:59,720
      Dale Carder: traffic compound what

      1058
      14:59:00,380 --> 14:59:03,320
      Dale Carder: forty ish annually

      1059
      14:59:03,380 --> 14:59:22,169
      Dale Carder: Um! Those firewalls and middle boxes are designed for typically administrative workloads where end of the day there's only so much data where all you guys sitting on your laptops in the conference room are gonna be competing for resources. That's very different. Right then. Uh it's scientific computing.

      1060
      14:59:22,180 --> 14:59:27,799
      Dale Carder: So there's things you know. Yes, kind of sort of worked on on that team such as like the Science team, Z model for

      1061
      14:59:27,830 --> 14:59:33,430
      Dale Carder: you know how to place resources at a site, how to change the perimeter architecture to better accommodate the

      1062
      14:59:33,540 --> 14:59:49,789
      Dale Carder: um data intensive sciences. So there there's there's opportunities there. But you know I I still don't see a world where you would go for a or could cost effectively deploy and off the shelf like Firewall Middlebox.

      1063
      14:59:50,840 --> 14:59:51,830
      Okay,

      1064
      14:59:52,530 --> 15:00:11,389
      Dirk: I'd love to be proven wrong. So can please do. Yeah, yeah, I had to comment on the the last line on this slide where you said Vice- white disparity in hp support for data centric workforce. I mean. We discussed that yesterday to a lot, and we know where I was

      1065
      15:00:11,400 --> 15:00:14,199
      Dirk: curious was If this actually

      1066
      15:00:14,240 --> 15:00:17,059
      Dirk: has an impact on how the

      1067
      15:00:17,370 --> 15:00:33,970
      Dirk: these Hpc. Facilities approach that building up their external connectivity, or if they just if that doesn't matter, they're still going for full connectivity to the to the data transfer notes, at least, even if they don't, they don't have the the like nurse where they want to support like

      1068
      15:00:41,960 --> 15:00:53,050
      Dale Carder: right, so it's helpful for me to think about this in terms of procurement life cycles, because I think the the Lcfs are also very much in that world

      1069
      15:00:53,060 --> 15:01:08,649
      Dale Carder: where you go out and you survey the user community for needs come up with a list of, you know, used cases that you're going to support, and you take the doe as part of Cd. Zero and say, Here, here's the mission need of what we do. Go into alternatives, analysis, and so on.

      1070
      15:01:08,970 --> 15:01:16,169
      Dale Carder: And then, five years later a machine shows up right on the dock right and and gets installed.

      1071
      15:01:16,180 --> 15:01:32,919
      Dale Carder: So it's really about being ahead of that, and then yes, networ in the exact same boat. So when we built the S. That's exactly the process we went through, beginning five six years ago, and here we are what they're going to have, like our official on grand unveiling next month.

      1072
      15:01:33,890 --> 15:01:52,230
      Dale Carder: Uh so in like, on the yesness side of the world, we've had these requirements, Reviews. So many of you here participated in the in the Requirements Review for for ha! We're currently doing one now for basic energy sciences, and this goes directly into our, You know, longer term procurement, forecast budgets

      1073
      15:01:52,240 --> 15:02:01,709
      Dale Carder: and things of of that nature, so that a we don't over build, you know, and spend a lot of taxpayer resources um way too early,

      1074
      15:02:01,730 --> 15:02:07,559
      Dale Carder: nor get caught on the other end, far, far behind from what where the needs lie.

      1075
      15:02:07,650 --> 15:02:13,420
      Dale Carder: This is essentially, we solve this on our end through just constant communication, and

      1076
      15:02:13,610 --> 15:02:21,599
      Dale Carder: beating people up like Andrew Mellow for for status as to what's going on, and and making sure that we're in lockstep.

      1077
      15:02:25,500 --> 15:02:41,649
      Dale Carder: So for for nurse like I said yesterday. Um like there has been a we're doing a requirements review for for basic uh energy sciences. And in there will be a case study for Lcls to, I think, and how that

      1078
      15:02:41,810 --> 15:02:47,979
      Dale Carder: that operation at Slack is going to be integrated with nurse because they're talking again. Terabytes

      1079
      15:02:48,100 --> 15:02:53,969
      Dale Carder: uh workflows from the beam line to compute, and then autonomous steering back.

      1080
      15:02:54,240 --> 15:03:00,560
      Dale Carder: So there's There's things there that that could be of relevance to this group to see how other groups are

      1081
      15:03:00,760 --> 15:03:02,240
      Dale Carder: are sort of handling it.

      1082
      15:03:07,670 --> 15:03:21,620
      Enrico Fermi Institute: So all right. So well, yeah, do you have one more? I'll do one more, and and if this is covered in another slide. Feel free to defer. Uh. But what's What's the yes net thinking on on on caching in the network.

      1083
      15:03:22,050 --> 15:03:31,530
      Dale Carder: Yeah, I'll. I'll have just a bullet on that. Yeah, we can. We can kind of open it up there as I get into the more.

      1084
      15:03:31,950 --> 15:03:36,629
      Dale Carder: Yeah, So let's talk about clouds. So there's sort of

      1085
      15:03:37,500 --> 15:03:40,180
      Dale Carder: the terminology around. Cloud stuff is like

      1086
      15:03:41,210 --> 15:03:44,880
      Dale Carder: amazingly hard to comprehend, because every vendor has their own

      1087
      15:03:45,130 --> 15:03:54,259
      Dale Carder: proprietary language, and they'll use the same words, and none of them are like actually descriptive of what's going on. But let's lump that into two bins,

      1088
      15:03:54,320 --> 15:03:56,470
      Dale Carder: Public cloud and Private cloud.

      1089
      15:03:56,820 --> 15:04:08,730
      Dale Carder: Public cloud is. You know what happens when you were to just log into an Ec. Two, console and and fire up a Vm. And you're going to get a network that's essentially, you know, to be this public facing

      1090
      15:04:08,790 --> 15:04:09,970
      Dale Carder: um,

      1091
      15:04:10,500 --> 15:04:17,009
      Dale Carder: and you know the those egress charges we you know, keep hearing about, apply, and things of that nature.

      1092
      15:04:17,210 --> 15:04:18,940
      Dale Carder: Private cloud.

      1093
      15:04:19,090 --> 15:04:33,729
      Dale Carder: This is where you would be um standing up, you know multitudes of um, instances of compute with some private back end network, and then that private back end network has some sort of, you know, egress

      1094
      15:04:34,030 --> 15:04:51,190
      Dale Carder: uh delivered through a multitude of means. But then that it has to connect to something out right right. It's fully self-contained. Uh, So you have to either connect back to your home institution or you some tunneling technology Uh, optionally. You can bring your own Ip, addressing Uh.

      1095
      15:04:51,480 --> 15:05:10,629
      Dale Carder: These typical workloads are administrative computing, so say, University of Chicago. Wanted to put the Hr system in in the cloud and keep it on the University of Chicago Network, as it Hr: Data, This is the technology would use. And you know, this should be like I should put this involved, but like

      1096
      15:05:10,650 --> 15:05:18,269
      Dale Carder: it's very expensive. And we're talking about data rates, you know, commiserate with administrative computing, not research computing.

      1097
      15:05:18,660 --> 15:05:34,160
      Dale Carder: And that's why you see, software Routers um software appliances doing these Cpms. So they've come up with um in addition to just multiple ways to extract money from you different ways to work around these limitations. So

      1098
      15:05:34,250 --> 15:05:43,919
      Dale Carder: if you're beyond the scale of what you can get away with with, you know, using a software based router and software. Based, you know, vpning traffic back to an institution.

      1099
      15:05:44,200 --> 15:05:58,230
      Dale Carder: There's uh dedicated interconnects that these are like, essentially charged by the hour uh connections. That's why I tried to put this like city going to the restaurant. This is the four dollars sign, you know. Uh menu option.

      1100
      15:05:58,400 --> 15:06:13,059
      Dale Carder: Um, you have Cloud Exchange, which was sort of some where you'd have like this uh intermediate broker, managing like the physical infrastructure for you. We have some of these today on Yes, now we're working to deprecate them because They're the three dollars same level,

      1101
      15:06:13,930 --> 15:06:24,750
      Dale Carder: and we're replacing them with this partner interconnection model, which is where you go out and you procure. And by you I mean like Yes, net goes out and procures a middle man

      1102
      15:06:24,860 --> 15:06:33,130
      Dale Carder: to handle this sort of like interconnection and get away from the hourly charge. Port charges um to the various entities

      1103
      15:06:33,290 --> 15:06:39,740
      Dale Carder: and throw some, you know, virtualization on top of that, and come out the only the two dollar sign approach.

      1104
      15:06:40,220 --> 15:06:42,260
      Dale Carder: But again, these are still it

      1105
      15:06:42,400 --> 15:06:55,590
      Dale Carder: humble data rates. Um! There's to put actual money on here. It's It's nearly impossible to. Uh, you know. Figure out what these things cost like. You need a, you know a used to car salesman to help you

      1106
      15:06:55,690 --> 15:07:09,500
      Dale Carder: uh figure this out, so putting that into like, Where are we today? Um connectivity wise. So in the public cloud realm again. Uh, if you' to stand up, you know

      1107
      15:07:09,640 --> 15:07:24,150
      Dale Carder: you know random sets of machines. This is sort of the connectivity we have, which is, you know, three hundred connections to major markets for Google, six connections to Oracle, five to Amazon, five to Microsoft.

      1108
      15:07:24,160 --> 15:07:35,560
      Dale Carder: And these are basically there and ready to go um, such as it's been mentioned earlier on, like Fermi labs being able to take advantage of the Google connectivity um

      1109
      15:07:35,840 --> 15:07:42,449
      Dale Carder: on a couple of occasions. Now, most recently, I think, in October, when last October, when there was that inference training on

      1110
      15:07:43,390 --> 15:07:51,479
      Dale Carder: um These are are very, very cost-effective. Such that we pay for these essentially out of the operating budget to the

      1111
      15:07:51,860 --> 15:08:07,890
      Dale Carder: So this is just our cost to doing business. We shared across all, do we? On it's. It's not a big problem, because what we do much like we built and teach the national labs we built. Yes, net six into the major commercial facilities. So we're there.

      1112
      15:08:07,900 --> 15:08:15,040
      Dale Carder: So a lot of these connections is just a jumper across the building. You know that kind of thing from our network to that that network There, go ahead.

      1113
      15:08:17,290 --> 15:08:23,250
      Dirk: But this this basically this doesn't give you a cost advantage. It just gives you capabilities. Right?

      1114
      15:08:23,260 --> 15:08:46,970
      Dirk: Yep, exactly. But this especially as Google. This matches very well with their uh, their flat, you know. Subscription model. Yeah, I mean, yeah, you still have the normal cost. So if you go. If you go just on demand, you just pay normal ecosystems. You just have to fast data connect there that we that you can actually run your workflows and then subscription. Okay, if you get rid of egress and you can, of course, use it fully. Okay. Thanks.

      1115
      15:08:46,980 --> 15:08:54,459
      Dale Carder: Yeah, exactly. I think Oracle also may grave egress fees. I forget who is using that. Uh in in do we

      1116
      15:08:57,180 --> 15:09:14,340
      Enrico Fermi Institute: so so quickly, though so to take advantage of this? If I were to log on the Ec. Two, and I've landed, and I guess the right availability. So I don't need to do anything special to jump on to. If i'm moving data from somewhere in Amazon to somewhere connected to Esnet. Six

      1117
      15:09:14,350 --> 15:09:18,790
      Enrico Fermi Institute: uh to me is the quote unquote user I don't have to do anything special to

      1118
      15:09:19,130 --> 15:09:32,770
      Dale Carder: right, and it's it's both. This whole slide probably applies both to Esnet and to Internet. To I think we're probably nearly identical and capabilities in this regard, because we it's just easy to scale up as as usage

      1119
      15:09:32,950 --> 15:09:38,519
      Dale Carder: is in place. One thing i'll point out, though you know, in these direct connections to these peers.

      1120
      15:09:38,560 --> 15:09:42,939
      Dale Carder: There is like human to human level negotiation to get these into place.

      1121
      15:09:42,970 --> 15:09:48,749
      Dale Carder: So, for example, it took months to connect to Google. You know, they said, well how much you're going to use, and we're like, I don't know all of it,

      1122
      15:09:49,050 --> 15:09:57,379
      Dale Carder: right. They were like, yeah, whatever. And then what we do we use all of you know, all of their gpus, for example, because we can.

      1123
      15:09:57,410 --> 15:10:10,549
      Dale Carder: Um, These providers are much more used to like diurnal traffic flows uh like you would see with, you know, commercial users during the day, and you know, residential users at night. Um, so there's like

      1124
      15:10:10,700 --> 15:10:15,559
      Dale Carder: to get these in place does require some negotiation and some long-reach plan,

      1125
      15:10:16,170 --> 15:10:19,010
      Dale Carder: because we have to talk them into it and prove we're going to use it

      1126
      15:10:21,010 --> 15:10:23,160
      Dale Carder: hollow. I see you've got your hand up.

      1127
      15:10:23,270 --> 15:10:38,420
      Paolo Calafiura (he): Yeah, I I It was a kind of a question already asked, but and then and then another one, so I I believe that there was also some uh peeing agreement with on the right if if we use your your boxes. But

      1128
      15:10:38,430 --> 15:11:05,940
      Paolo Calafiura (he): some discounts are you guys with up? If I recall correctly the Amazon. One is something more like if you use X amount of compute, some percentage of that can be. Yeah, yeah, exactly. Something on that. And then, just out of curiosity. Why do you have the most boxes to? Or a call Is Is Is that the just because it happened? And they were easy to deal with, or because there is a use for use it for a long

      1129
      15:11:05,950 --> 15:11:25,680
      Dale Carder: um. There's almost certainly someone going to use it like do is very, very big. So between uh office of science and an essay, and all the other stuff going on. Um, there's also, uh, you know, the the doe uh Federal network itself, which is now overlay on. Yes, net

      1130
      15:11:25,690 --> 15:11:30,750
      Dale Carder: sort of That's the nice thing about Once you hit the scale, we can kind of share the economics of this,

      1131
      15:11:32,400 --> 15:11:52,140
      Dale Carder: and then quickly I go over the the private cloud interconnects. Uh So this is where we have uh we're pretty into place actually, as we speak uh, tear a bit of connectivity to a third party called fabric packet fabric, and then they go through and punch uh uh physical connectivity into each of the vendors

      1132
      15:11:52,150 --> 15:12:07,749
      Dale Carder: for the private cloud hosting that will replace things like we had previously with uh the Cloud Exchange product. So again, that's It's a bit more um, you know, targeting administrative workloads. But

      1133
      15:12:07,790 --> 15:12:13,040
      Dale Carder: as we get into talking about Lhc. One, maybe it's a model that could be used there, too. I don't know

      1134
      15:12:13,120 --> 15:12:14,430
      Dale Carder: uh Dirk,

      1135
      15:12:17,590 --> 15:12:21,010
      Dirk: I think Fernando was first. If he wants to go.

      1136
      15:12:24,700 --> 15:12:31,579
      Dirk: No, now he will lower the sand. I I just had a quick question. So sorry. Sorry I was not. I was on mute,

      1137
      15:12:31,590 --> 15:12:48,210
      Fernando Harald Barreiro Megino: and I still didn't get over the public cloud section. So if i'm uh I need to be on Google on the Availability Zone, Seattle on the region Seattle, Chicago, or Nyc. In order to

      1138
      15:12:48,450 --> 15:12:54,120
      Fernando Harald Barreiro Megino: my trousers, go through Esnet.

      1139
      15:12:54,640 --> 15:13:11,879
      Dale Carder: Every vendor is different with Google. I think they announce, or they they will haul traffic regardless of where it ingresses or egresses. Their network. Amazon is the exact opposite where you have to send it to traffic to the exact zone. So everyone it

      1140
      15:13:11,890 --> 15:13:18,989
      Dale Carder: All these systems are proprietary in that regard, and you, unfortunately kind of have to know in advance what you're walking into,

      1141
      15:13:24,480 --> 15:13:31,589
      Fernando Harald Barreiro Megino: and there is a transit to I mean Southwest you to or anywhere at some university, and

      1142
      15:13:31,630 --> 15:13:34,459
      Fernando Harald Barreiro Megino: the Us. That will

      1143
      15:13:35,020 --> 15:13:40,880
      Fernando Harald Barreiro Megino: go through the in, I mean, through the normal Internet will not end up in years now, right

      1144
      15:13:41,730 --> 15:13:51,500
      Dale Carder: right? So for Google, that'd be the case for Amazon, Where? Yes, that does not appear with Amazon in Europe we would probably never see the traffic until it shows up through whatever

      1145
      15:13:51,750 --> 15:13:53,490
      Dale Carder: all their path exists.

      1146
      15:13:53,590 --> 15:13:54,440
      Okay,

      1147
      15:13:54,860 --> 15:13:57,260
      Fernando Harald Barreiro Megino: Okay, thanks.

      1148
      15:13:58,290 --> 15:14:08,890
      Dirk: Okay, And I had a question yesterday. That was when I, when we talked briefly about Lanceium. There was a I remember, from talking with them that they said they had plans to peer with.

      1149
      15:14:08,900 --> 15:14:24,610
      Dirk: I think it was the snet. Are you aware of anything? I mean? I think they're still building the data center? So i'm not sure at what stage they are with that on that. But our general, our our peering policy, is relatively wide open songs. We can justify it

      1150
      15:14:24,720 --> 15:14:34,139
      Dale Carder: uh so in any in new market entrance which should not be a barrier on the network as long as they show up at essentially any

      1151
      15:14:34,260 --> 15:14:44,270
      Dale Carder: major uh co-location facility where networks come and meet together. So, for example, we're in Houston we're in dallas we're in El Paso I mean kind of their neck of the woods.

      1152
      15:14:46,180 --> 15:14:48,409
      Dale Carder: So that question is very easy.

      1153
      15:14:53,330 --> 15:14:57,830
      Dale Carder: All right, all right. Um, This is sort of more like the

      1154
      15:14:58,240 --> 15:15:00,100
      Dale Carder: I dumped all the other stuff here.

      1155
      15:15:00,180 --> 15:15:07,480
      Dale Carder: Um. So some other things. Yes, that has um that sort of just worth having on your laundry list of things to know. It exists.

      1156
      15:15:07,620 --> 15:15:10,210
      Dale Carder: Um, one is, you know,

      1157
      15:15:10,530 --> 15:15:29,139
      Dale Carder: Api's. And you know, dynamic requesting of resources is something that Yes, that's um long since supported for layer, two circuits, including uh on bandwidth scheduling uh on demand um and prioritization. That is how the Ic. Op. And

      1158
      15:15:29,150 --> 15:15:32,900
      Dale Carder: uh circuits are instantiated between it's your zero and the tier one's

      1159
      15:15:33,160 --> 15:15:42,740
      Dale Carder: um also sort of in-flight is um dynamic layer three instantiation works internally the snet. We actually used it for lsst

      1160
      15:15:42,840 --> 15:15:47,320
      Dale Carder: uh between slack and uh, the the South American networks.

      1161
      15:15:47,570 --> 15:16:07,369
      Dale Carder: Um. It's completely conceivable to open that up also, and that could be used as a way if you wanted to. Dynamically, uh, you know, acquire cloud resources at the Api endpoint and and fire it up. Um! So these things are very much like near reality. Showed a used case. Justify for their development.

      1162
      15:16:07,500 --> 15:16:19,190
      Dale Carder: Um, there's an R. And D project underway with um Russia and integration with our framework called sense, which is again a more on uh, you know, dynamic network path, provisioning

      1163
      15:16:19,610 --> 15:16:24,199
      Dale Carder: across the

      1164
      15:16:24,920 --> 15:16:26,660
      Dale Carder: what you call it on

      1165
      15:16:27,930 --> 15:16:40,060
      Dale Carder: uh potential sort of sort of kicking off now. Um, where um sort of like the nurse super facility concept is sort of making it, you know the logical next sleep

      1166
      15:16:40,620 --> 15:16:55,390
      Dale Carder: uh internal to Yes, and we now have um Fpga experience in house. So we have uh been working on some projects where we're using Fpga's um to accelerate uh different

      1167
      15:16:55,450 --> 15:17:11,650
      Dale Carder: the sort of like used cases we've seen um from like uh triggers to compute uh and load dynamically, load bouncing and hardware, or working on something like that for J. Lab. I think there is also a similar effort underway between uh

      1168
      15:17:11,760 --> 15:17:13,539
      Dale Carder: Als and nurse

      1169
      15:17:14,050 --> 15:17:18,290
      Dale Carder: um. In addition, those Fpgas can be used to, you know,

      1170
      15:17:18,420 --> 15:17:28,050
      Dale Carder: in in my like crystal ball out like we think about. If we hit. You know the scaling limits of cpus. It also probably means we'll end up hitting the scaling limits of Tcp.

      1171
      15:17:28,150 --> 15:17:39,700
      Dale Carder: Um. Someone smarter than me is probably already figured out when that exists. But, uh, we're we're sort of ready for that era with the ability and yes net code running on Fpga's today

      1172
      15:17:40,630 --> 15:17:48,639
      Dale Carder: on the more operational side of the house. Um, we've got an R. And D project underway and deployment of uh packet marking

      1173
      15:17:48,650 --> 15:18:07,750
      Dale Carder: uh. So using annotations in like the Ipd six packet header to identify uh what workload is running, and then reporting that back out from like an accounting perspective of you know what science, domain, and activity is on a particular link to something that's been. That'll be pretty useful for for

      1174
      15:18:07,760 --> 15:18:12,620
      Dale Carder: planning, you know, capacity, planning, traffic, engineering, those sort of use cases.

      1175
      15:18:13,490 --> 15:18:27,130
      Dale Carder: And then here's my catch all for, uh, you know X cash. So uh I I think it's certainly a a promising future of more integrated. Uh, you know,

      1176
      15:18:27,230 --> 15:18:39,629
      Dale Carder: caching or even bigger picture storage on or in the network, and to better use the resources available to us, for example, latency hiding as being sort of an easier use case.

      1177
      15:18:39,640 --> 15:18:49,580
      Dale Carder: And then uh, I think there's currently cashes There's one in California. I don't know the status. There might be one in Chicago, and also one plan for Boston.

      1178
      15:18:50,090 --> 15:19:01,929
      Dale Carder: But it seems to me, you know, my engineer approach, for, like Guy, who doesn't make the decisions is that seems pretty straightforward, and something we should continue to work on.

      1179
      15:19:02,990 --> 15:19:21,869
      Enrico Fermi Institute: Yeah, go ahead. No, I was gonna say i'm sorry. I thought you were gonna go on to the next slide. I was going to say i'm just rambling. Um, Could you talk a bit more about this uh the the layer three vpn instantiation. So So who this would be? Science expanding um

      1180
      15:19:21,890 --> 15:19:41,860
      Dale Carder: their their vpn from from from whom? To whom? I guess you you know I mean how I build things. It's the support from anywhere to anywhere. So it's It's nebulous. Um, because it's you know it's a generic framework. So the idea being you've got site eight and site Z. They Wanna and and site F.

      1181
      15:19:42,270 --> 15:19:51,650
      Dale Carder: They could create a private network overlay on the for that activities. Um, traditionally, that was something very hard to do. You'd have to go around, and, you know, signal up circuits, or

      1182
      15:19:51,700 --> 15:19:59,089
      Dale Carder: do all this work Now it would look much more like, hey? Here's a vlan and can actually router into it, and it will just get to the other side,

      1183
      15:19:59,120 --> 15:20:09,500
      Dale Carder: and it's completely private. It's the same technology that cloud providers are using on the back end for their virtual private networks. So if you guys, they're using the same kind of thing.

      1184
      15:20:09,900 --> 15:20:17,080
      Enrico Fermi Institute: So so basically I I could hit some Api on on on the all side. And

      1185
      15:20:17,090 --> 15:20:42,269
      Enrico Fermi Institute: you would say, Okay, you connect to V Vlan Number five hundred and twenty-three, and the other one connects to. I don't know six hundred and seventy-two, and that vlan on your on your end is gonna um The vlines are, you know, tunneled together? You. You handle stitching together the layer two surface, or whatever it takes to get from point. Yeah, or even layer three circuits. So you've got full resiliency within Continental us, and that kind of thing.

      1186
      15:20:42,290 --> 15:20:45,820
      Dale Carder: So yeah, it's pretty promising. Uh,

      1187
      15:20:45,940 --> 15:21:03,750
      Dale Carder: I think. Just need more exploration of what the used cases are there like. We built it for ourselves, but there's nothing preventing that um and sort of how it was designed to to take the setup. One of these circuits takes takes longer to fill out the form in our database,

      1188
      15:21:03,760 --> 15:21:21,099
      Enrico Fermi Institute: Gotcha and does this, he said. Anyone anyone so I could in in potentially set this up, uh, you know, at Vayner built and have the other end be a cloud provider. Yeah, that's that's what i'm thinking could be a popular use case

      1189
      15:21:21,110 --> 15:21:25,710
      Dale Carder: right? And and maybe even wanted to have a second cloud provider. I mean, that's It's totally doable.

      1190
      15:21:25,820 --> 15:21:32,929
      Enrico Fermi Institute: Okay, Yeah, that definitely, I can definitely think of a a few interesting things you could do with that.

      1191
      15:21:33,340 --> 15:21:38,839
      Dale Carder: It it's something that again it's sort of like. Let's plant the seed of a you know, capability that exists,

      1192
      15:21:39,060 --> 15:21:41,780
      Enrico Fermi Institute: and see if there's a a good use for it.

      1193
      15:21:42,680 --> 15:21:44,940
      Enrico Fermi Institute: Oh, thank you. Yeah,

      1194
      15:21:46,080 --> 15:21:52,379
      Dale Carder: okay. And then here's where we drift off from the the known to the the less known.

      1195
      15:21:52,410 --> 15:22:02,250
      Dale Carder: So this in thinking about sort of these facilities as part of a greater ecosystem we've covered the do we space? Well, because, like Bonnie Hasn't,

      1196
      15:22:02,370 --> 15:22:16,099
      Dale Carder: Now, if you think about the Nsf. Hpc. Sites in particular, there's it's even more disparate as to their to their connectivity and capabilities. So some sites like off the top of my head

      1197
      15:22:16,860 --> 15:22:27,859
      Dale Carder: uh San Diego um, and do extremely well connected like wouldn't wouldn't worry about them. Um, because typically they have like they own their infrastructure. Um!

      1198
      15:22:27,880 --> 15:22:33,060
      Dale Carder: Ncsa is another one where moodles of network connect to it exists,

      1199
      15:22:33,270 --> 15:22:44,040
      Dale Carder: but then there's other centers i'll, you know, unfortunately, like I think, is in the scenario, where, like their machine, is like often some business park outside of town or

      1200
      15:22:44,290 --> 15:22:49,719
      Dale Carder: and it and there's not necessarily good connectivity to the for a data centric workflow.

      1201
      15:22:49,820 --> 15:22:56,510
      Dale Carder: So if you're thinking about running more on Nsf. Hpc. Facilities, you need to have a facilitation with

      1202
      15:22:56,690 --> 15:23:01,789
      Dale Carder: the sites you're thinking about to answer some key questions of. Can you get your data in and out

      1203
      15:23:02,030 --> 15:23:04,070
      Dale Carder: uh in a production fashion?

      1204
      15:23:04,380 --> 15:23:08,250
      Dale Carder: Because it's There's a huge disparity between sites

      1205
      15:23:09,720 --> 15:23:13,179
      Dale Carder: now on the Us. Side. Um,

      1206
      15:23:13,510 --> 15:23:24,780
      Dale Carder: We covered some of that um just before I started. But what of note? Yes, and that is talking to every single Us tier, two site basically in preparation for high luminosity.

      1207
      15:23:25,070 --> 15:23:32,899
      Dale Carder: As such we were sort of like getting a good view as to where the the universities are, with their regional networks,

      1208
      15:23:33,370 --> 15:23:41,009
      Dale Carder: and in general, I think, with enough prior planning which was our goal. The outlook continues to be good.

      1209
      15:23:41,080 --> 15:23:44,200
      Dale Carder: Um! But we need to keep that facilitation game up

      1210
      15:23:44,300 --> 15:23:56,660
      Dale Carder: uh and make sure that you know for especially universities that have one or two intermediate networks between them and yes, another internet too, that everything upgrades and lockstep, or we can't connect these things together,

      1211
      15:23:57,680 --> 15:24:00,260
      Dale Carder: so that that present looks

      1212
      15:24:00,450 --> 15:24:18,510
      Dale Carder: good, and the key to making this work from my perspective is the data challenges a thing where we can point to and say By this date it has to work as follows: The The data challenges are are going to be the the forcing function that the the community uses the for internal justification. The,

      1213
      15:24:18,520 --> 15:24:30,059
      Dale Carder: you know, show their administration like the you know, the the pro or whatever like. Hey, We do need this stuff, and and here's where we're on. We need it by, and that that program is finally important.

      1214
      15:24:32,030 --> 15:24:38,670
      Dale Carder: Now, on to the the perhaps more questioning stuff on my part, which is

      1215
      15:24:38,700 --> 15:24:46,129
      Dale Carder: this community has a network called Lhc. One which is sort of called a whole nother Internet connecting just the

      1216
      15:24:46,150 --> 15:24:49,229
      Dale Carder: resources together that exclusively

      1217
      15:24:49,350 --> 15:24:53,840
      Dale Carder: you know work on these large-scale projects for Lhc

      1218
      15:24:54,160 --> 15:25:08,620
      Dale Carder: So in the Us. You've got um Us Cms and us Atlas sites the tier one's in the tier, two centers connected to Lc. One, and then yes, net has transatlantic connectivity where we connect to our our peer networks in the Eu

      1219
      15:25:09,200 --> 15:25:13,050
      Dale Carder: again to the major tier, one into your two centers.

      1220
      15:25:13,580 --> 15:25:16,670
      Dale Carder: On those networks there is, you know,

      1221
      15:25:17,460 --> 15:25:28,839
      Dale Carder: for better or worse. Ip addresses are used as authorization tokens for what traffic can go on to that network, because that network has an acceptable use policy defining what can and can't be on it.

      1222
      15:25:29,270 --> 15:25:34,900
      Dale Carder: Uh, namely, it's exclusive. It's for exclusive use of obviously traffic.

      1223
      15:25:35,250 --> 15:25:46,019
      Dale Carder: Now, in the case where you've got a dedicated facility and all it does. Or Maybe you have dedicated Dtn machines, and all they do is um, you know, traffic that's Top related.

      1224
      15:25:46,070 --> 15:25:54,429
      Dale Carder: It's pretty straightforward when you start thinking about cloud resources, or even some of the bigger clusters, even seen like an open science grid.

      1225
      15:25:54,440 --> 15:26:08,900
      Dale Carder: These are multi-science uh compute nodes. And we talked to our our peers at Brooklyn. This is already happening there where they they have cluster, can run any job but this restriction of what traffic can go over. Lhc. One

      1226
      15:26:09,470 --> 15:26:13,799
      Dale Carder: a limiting factor, because now the source Ip address of the Node banners

      1227
      15:26:13,910 --> 15:26:16,910
      Dale Carder: and trying to adhere to the aup,

      1228
      15:26:16,990 --> 15:26:18,649
      Dale Carder: Is there a problem?

      1229
      15:26:19,940 --> 15:26:25,579
      Dale Carder: So we figured this out essentially to the degree of very static resources. Right? This works

      1230
      15:26:25,710 --> 15:26:29,680
      Dale Carder: very well for the tier ones and tier two is especially in the Us.

      1231
      15:26:29,890 --> 15:26:40,300
      Dale Carder: But it does not to me have a clear understanding of how you would integrate external resources into this. Um!

      1232
      15:26:40,580 --> 15:26:50,020
      Dale Carder: It's an open discussion uh at this point. It's not like I'm here with any answer. I'm just saying like I think we can all agree that something to be worked on

      1233
      15:26:50,830 --> 15:27:07,149
      Dale Carder: um that has big and public implications, particularly for the transatlantic traffic. So that's why I had this on. Here is yes, that currently has five, one hundred. You pass across the Atlantic. We're bringing up uh two additional four hundred gig um paths.

      1234
      15:27:07,210 --> 15:27:08,380
      Dale Carder: Um

      1235
      15:27:08,730 --> 15:27:12,250
      Dale Carder: sometime next year. Hopefully, these are like very, very.

      1236
      15:27:12,320 --> 15:27:21,510
      Dale Carder: It intensive builds um to get. You know we're not just buying circuits. We're buying spectrum on undersea cables and integrating it into our network

      1237
      15:27:21,930 --> 15:27:25,809
      Dale Carder: so and then the contracting side of this is

      1238
      15:27:26,100 --> 15:27:39,229
      Dale Carder: mind-bogglingly complex and these are multi year procurements with nda's in place. So we have additional links that we're going to come in after these two by four hundred. So we're trying to get on additional cables with additional spectrum.

      1239
      15:27:39,240 --> 15:27:46,920
      Dale Carder: All of this is very easy for us to integrate into X one. It's very easy for us, and straightforward it integrated into the do ecosystem

      1240
      15:27:47,680 --> 15:27:53,679
      Dale Carder: Again, How would you use? How would you use that with the like third party? Cloud sites

      1241
      15:27:54,780 --> 15:27:58,399
      Dale Carder: open for exploration? It's not clear.

      1242
      15:27:59,460 --> 15:28:16,819
      Enrico Fermi Institute: So so is it fair to say that, you know it seems like all the physical capability is kind of there when it comes to talking to clouds, but you know, doing things like getting a block of ips and announcing those to Lhc. One is, and challenging with the public clouds.

      1243
      15:28:16,880 --> 15:28:17,920
      Dale Carder: Yup,

      1244
      15:28:18,030 --> 15:28:19,470
      Dale Carder: um!

      1245
      15:28:19,580 --> 15:28:27,449
      Dale Carder: Whereas maybe a more straightforward topology is actually maybe something more like he cloud where

      1246
      15:28:27,590 --> 15:28:29,010
      Dale Carder: you know, from

      1247
      15:28:29,040 --> 15:28:32,810
      Dale Carder: the networks perspective, it's Fermi lab on either end.

      1248
      15:28:32,960 --> 15:28:36,210
      Dale Carder: It's for me lab stuff in the cloud for me to have stuff at home,

      1249
      15:28:36,310 --> 15:28:37,490
      Dale Carder: and then it

      1250
      15:28:37,660 --> 15:28:39,219
      Dale Carder: and branch out

      1251
      15:28:39,710 --> 15:28:59,000
      Enrico Fermi Institute: that maybe it more workable model for at least for doe. So you're saying like for for transatlantic traffic you would. The Fermi lab is kind of the the responsible party for making sure that that they're agreeing with the aup and their traffic going across the transatlantic link is is Lhc. Traffic, and

      1252
      15:28:59,010 --> 15:29:02,869
      Enrico Fermi Institute: and an appearing happens between the cloud and and Fermi Lab,

      1253
      15:29:03,190 --> 15:29:11,249
      Dale Carder: or and yes. But yeah, So the the essentially is such that you know any do we resource can can do whatever they want?

      1254
      15:29:11,340 --> 15:29:24,840
      Dale Carder: Um, including talk to universities. Um. But at present the aup doesn't straightforwardly allow a tier two to use cloud resources that would be brokered by Yes, and as the middle man,

      1255
      15:29:25,600 --> 15:29:31,230
      Dale Carder: to use a cloud resource and expect it to use all this transatlantic capability

      1256
      15:29:31,390 --> 15:29:33,569
      Dale Carder: on that he always invested in.

      1257
      15:29:36,890 --> 15:29:48,730
      Dirk: Yeah, I wanted to comment on that, and I I think I mean you already said that that's part of the the strategy that you Cms is going with with with. Have cloud that we

      1258
      15:29:48,830 --> 15:29:51,169
      Dirk: we kind of keep it contained.

      1259
      15:29:51,180 --> 15:30:08,940
      Dirk: So if we like the large okay, we haven't done anything with large cloud use in a while like, not nothing like the the Amazon test and the Google test, and five six years ago. But but even then I think we only targeted regions, the resources in the Us. So that the the kind of the data traffic,

      1260
      15:30:09,170 --> 15:30:17,359
      Dirk: the data traffic was contained in the Us. Mostly to between Fermi lab and these external resources, and then any kind of

      1261
      15:30:17,370 --> 15:30:31,629
      Dirk: output. The output is transferred over the transfer Linux somewhere else to your up inside. That, then, is an independent step that comes after, and it can go through the Lg. One network because it's it originates at Fermi L. At that point,

      1262
      15:30:31,860 --> 15:30:48,299
      Dirk: and the same way for the Hpc. Integration that we, the the way we integrate these Hpc resources is is they're connected to Fermi Lab. Everything stays together basically uh with with uh within the Us. And um,

      1263
      15:30:48,550 --> 15:30:56,320
      Dirk: I don't know. I mean, Fernando, if if if you have a contract as as if Cms would have a cloud contact and they would want to do a run

      1264
      15:30:56,390 --> 15:31:05,890
      Dirk: where they basically use all the regions in the world together. Then that's obviously. Then then it becomes a a problem. Because you're you're talking about overlaying Uh:

      1265
      15:31:06,820 --> 15:31:14,690
      Dirk: the global cloud resource. Mix on top of a somewhat partition network infrastructure.

      1266
      15:31:17,700 --> 15:31:20,260
      Dirk: Fernando: What regions are you using right now?

      1267
      15:31:20,630 --> 15:31:26,110
      Dirk: Okay. So it's It's all Europe okay

      1268
      15:31:28,290 --> 15:31:32,210
      Dale Carder: and just domestic to the Us. Um,

      1269
      15:31:32,900 --> 15:31:35,930
      Dale Carder: you know the universities, like the tier, two sites

      1270
      15:31:36,070 --> 15:31:42,160
      Dale Carder: to large degree of separated their You know their Lhc traffic from the rest of their institution traffic.

      1271
      15:31:42,440 --> 15:31:43,320
      If

      1272
      15:31:43,420 --> 15:31:48,399
      Dale Carder: those lines are to get blurred that could have essentially uh impact

      1273
      15:31:48,560 --> 15:31:55,439
      Dale Carder: on the universities, you know, like you can imagine scientific workloads overwhelming. You know the cat videos and streaming lectures

      1274
      15:31:55,630 --> 15:32:03,819
      Dale Carder: right? So it's. It's something to be quite mindful of how the current sort of ecosystem is built, and if you wanted to more fit

      1275
      15:32:04,130 --> 15:32:07,150
      Dale Carder: the communication necessary to do so,

      1276
      15:32:15,340 --> 15:32:18,480
      Dale Carder: So that's that's what I had. I'm happy to

      1277
      15:32:18,540 --> 15:32:21,630
      Dale Carder: answer more questions, or even just

      1278
      15:32:22,990 --> 15:32:29,649
      Enrico Fermi Institute: I had a small question. Yeah, you. You mentioned that the the connectivity to Nsf.

      1279
      15:32:29,710 --> 15:32:53,440
      Enrico Fermi Institute: Uh sites is, uh, I guess, Spotty, maybe. Yeah, I notice how I didn't put that in the slide, but I can. I can read between the lines. Um! So what is so? You know that there's this. There's a facility that's being built up or is built outside of Boston, some acronym, but it's like a green data center type thing um that all of the Boston area

      1280
      15:32:53,450 --> 15:33:13,260
      Enrico Fermi Institute: uh, and something that both Cms and I know Alice's as well as they have some large storage um, some large tape library that uh that we've each bought some part into it. Is this: Uh: on the end of the the better connected. Uh,

      1281
      15:33:13,630 --> 15:33:21,159
      Dale Carder: yeah, it It benefits that. You know it's basically on network for Mit.

      1282
      15:33:21,440 --> 15:33:35,470
      Dale Carder: So all right, So they They're facilitating a lot of the They're even going to be facilitating. I think it was in the interim the connectivity for uh for net two, which is the atlas uh node there right

      1283
      15:33:35,830 --> 15:33:50,070
      Dale Carder: so I don't know if there's anyone from Mit on that call here, but I think the majority of their stuff is at base lab. It's not an Mp. Pcc. But net. Two does have their their new infrastructure, and their existing infrastructure will be at Mghpcc.

      1284
      15:33:50,720 --> 15:33:53,170
      Dale Carder: And right they have some,

      1285
      15:33:53,690 --> 15:33:58,469
      Dale Carder: you know, magic storage back into my understanding. They're gonna leverage for that.

      1286
      15:33:58,920 --> 15:34:17,979
      Enrico Fermi Institute: Uh, I think that one of the folks uh they have a very large Ibm tape library with Gpfs upfront.

      1287
      15:34:20,850 --> 15:34:23,289
      Dale Carder: So you get another question. Hand up, David.

      1288
      15:34:24,860 --> 15:34:37,569
      David Southwick: Hi, thanks. Um! Maybe This is a naive question. But if you've got in the current scenario of traffic, let's say tunneling through for me. Um! And you're wanting to add

      1289
      15:34:38,030 --> 15:34:44,399
      David Southwick: for whatever uh cloud providers, and they're all at two hundred four hundred gigabit.

      1290
      15:34:45,120 --> 15:34:55,080
      David Southwick: You get a bottleneck when you do that

      1291
      15:34:55,480 --> 15:34:59,770
      Dale Carder: right? So that sort of architecture is fine to a point.

      1292
      15:35:02,510 --> 15:35:05,119
      David Southwick: Okay, thanks. I think I understand.

      1293
      15:35:05,180 --> 15:35:08,980
      Dirk: Maybe to to say something. I mean the what we did with

      1294
      15:35:09,510 --> 15:35:17,119
      Dirk: through Hep Cloud integration. It's not So much tunneling for farming is that you basically keep the problem set contained to

      1295
      15:35:17,130 --> 15:35:36,849
      Dirk: for me plus cloud. And then in a later, in the completely asynchronously of of the first one. It's of how Fermi integrates with the rest of the Lhc. In infrastructure. So you you kind of to tie it together at the storage level. Basically you move some data to Fermi Lab. And then, independently of that, once that data actually sits there.

      1296
      15:35:36,860 --> 15:35:49,760
      Dirk: Then you can schedule work on that data that can run on on cloud sites. And then the network traffic to get that data to the cloud side runs from farming. Basically So they're independent steps. But of course, I mean eventually,

      1297
      15:35:50,210 --> 15:36:07,170
      Dirk: just because you removed the timing, and it's not an immediate tunnel it's. Still, you still have to get that to keep these resources fed at at cloud, and also at Hbc. Side. So eventually, as as the integrated capacity you want to, you want to feed in terms of computing

      1298
      15:36:07,180 --> 15:36:15,850
      Dirk: it goes up. You you kind of you have to also work, on the other hand, to basically keep the pipeline full of things to work on.

      1299
      15:36:17,670 --> 15:36:25,430
      Enrico Fermi Institute: So so with that with that connectivity, or with the connectivity that's in place today with that model that the Fermi lab

      1300
      15:36:25,440 --> 15:36:45,199
      Enrico Fermi Institute: used or is using, I mean, would that be able to take advantage of of all that physical connectivity? I mean the thing i'm kind of struggling with is, how do we? How do we go from, You know? Yes, that has all this great physical connectivity to clouds. Uh to You know. How do we take advantage of that? And a meeting full way.

      1301
      15:36:45,210 --> 15:37:01,969
      Enrico Fermi Institute: You know what I mean, and and I know a lot of that kind of falls under your bucket of things that are hazy and need to be investigated more. Um, you know. Is it? Is it that you know, like if we were to do this for Alice, should we, you know, mediate all of the data transfer through the tier one and and kind of,

      1302
      15:37:02,140 --> 15:37:06,130
      Enrico Fermi Institute: I guess orthogonalize the problem kind of like how formula it has right where you have

      1303
      15:37:06,160 --> 15:37:11,739
      Enrico Fermi Institute: connectivity from from cloud to to national lab as one bit, and then national lab to

      1304
      15:37:11,780 --> 15:37:29,779
      Dale Carder: right. So you've got that, I mean, that's the class of solutions. Right? That's the solution space. If you want to work within those confines. If I are, you know, a program officer at uh doe or Nsf. I would say,

      1305
      15:37:30,210 --> 15:37:34,860
      Dale Carder: Why do you need to do that? What are the other barriers that exist?

      1306
      15:37:34,930 --> 15:37:38,519
      Dale Carder: Tackle those as well? Because some of these are social, political,

      1307
      15:37:38,680 --> 15:37:54,999
      Enrico Fermi Institute: alright. So it's sort of just where do you want to? I mean, of course, our goal is to, you know, have something to to say in the report, right? And so what what recommendation should we make right that that people go

      1308
      15:37:55,310 --> 15:38:05,190
      Dale Carder: right? So on that front, on one thing that basically came out of this community. Um, if you want to back way up, was the current um

      1309
      15:38:05,300 --> 15:38:23,189
      Dale Carder: grant system at Nsf. Has through the what's now the Cc star uh program that facilitates campus uh and regional upgrades basically manifested from the Yes net science Team Z model. And then, asf uh community buying in that is the

      1310
      15:38:23,310 --> 15:38:36,369
      Dale Carder: an an architectural model that they should provide, you know, financial support, for if you could extend upon that and say, You know, if you can imagine a world where you could seamlessly take advantage of resources, no matter where they lie. What would you need?

      1311
      15:38:36,610 --> 15:38:42,729
      Dale Carder: Couldn't us that program evolve, or again facilitate that kind of uh, you know,

      1312
      15:38:42,760 --> 15:38:43,990
      Dale Carder: connectivity,

      1313
      15:38:45,710 --> 15:38:50,300
      Dale Carder: you know. And in the time scale where it's talking about that's not unreasonable.

      1314
      15:38:54,850 --> 15:38:55,900
      Enrico Fermi Institute: Okay,

      1315
      15:38:57,950 --> 15:39:00,670
      Enrico Fermi Institute: were there other questions for Dale.

      1316
      15:39:08,300 --> 15:39:13,280
      Enrico Fermi Institute: Okay? Well, thanks a lot, Dale. I think this is a really interesting discussion.

      1317
      15:39:13,310 --> 15:39:20,250
      Dale Carder: Yeah, um. And i'll stick around um for the rest of the conference, too. So more stuff comes up. Um,

      1318
      15:39:20,320 --> 15:39:21,780
      Enrico Fermi Institute: yeah, that'd be great.

      1319
      15:39:22,730 --> 15:39:25,880
      Enrico Fermi Institute: All right. I will try to go back to the

      1320
      15:39:26,070 --> 15:39:28,459
      Enrico Fermi Institute: sharing the slides over here.

      1321
      15:39:28,910 --> 15:39:30,250
      Enrico Fermi Institute: Um.

      1322
      15:39:30,800 --> 15:39:36,420
      Enrico Fermi Institute: So the next section. This kind of leads in the next section. We wanted to talk a little bit about

      1323
      15:39:36,490 --> 15:39:38,910
      Enrico Fermi Institute: R. And d efforts.

      1324
      15:39:41,730 --> 15:39:44,150
      Enrico Fermi Institute: Now we've covered some of this already.

      1325
      15:39:46,490 --> 15:39:50,170
      Enrico Fermi Institute: Um, Dirk, Did you want to say a couple of things about this?

      1326
      15:39:50,440 --> 15:40:03,390
      Dirk: Yeah. And that that comes through? This comes directly i'll look at on the comes directly from a question that's in the charge Where, basically, ask us, is there anything we can do on the on the site, or that

      1327
      15:40:03,670 --> 15:40:05,530
      Dirk: that is needed to

      1328
      15:40:05,900 --> 15:40:09,369
      Dirk: to what we can do needed to expand

      1329
      15:40:09,590 --> 15:40:23,570
      Dirk: the range of what we can do on commercial Cloud and Hpc. All increase the cost effect on us, which kind of goes hand in hand. And uh, we already talked a little bit about Lcf integration and the Hpc. Focus area that there's

      1330
      15:40:23,640 --> 15:40:27,459
      Dirk: work to be done on the Gpu workloads, which is

      1331
      15:40:27,810 --> 15:40:35,630
      Dirk: somewhat out of scope for this conference, because we're not for this workshop because we're not supposed to talk about framework on software development.

      1332
      15:40:35,680 --> 15:40:52,100
      Dirk: Um, But then there's also integration work. We talked a little bit about this on the cost side that it's a bit at this point uh like estimating Lc. F Long term operations. Cost is a bit hard because the integration is not fully worked out.

      1333
      15:40:52,170 --> 15:41:01,009
      Dirk: Um software delivery kind of during the Hbc. Focus for every kind of agreed that even if it's is everywhere,

      1334
      15:41:01,020 --> 15:41:12,510
      Dirk: and then there's at services which is also every Hpc. Seems to do their own thing and what they support. They all want to support it, but they kind of have different solutions in place,

      1335
      15:41:12,540 --> 15:41:15,390
      Dirk: and it's also to me at least a bit unclear

      1336
      15:41:15,420 --> 15:41:20,420
      Dirk: with the long-term operational needs there on this area.

      1337
      15:41:20,900 --> 15:41:28,610
      Dirk: And then we already talked a little bit about dynamic cloud users. Uh, which means basically you you. You do like your whole.

      1338
      15:41:28,750 --> 15:41:44,449
      Dirk: The whole processing chain inside the cloud uh phenomena talked about that a little bit because it to reduce e egress charges. We basically you copy and your input data once or and then do multiple processing runs on it and

      1339
      15:41:44,460 --> 15:41:56,950
      Dirk: only keep the the end result basically and forget about the the intermediate output, and then you save one. You don't have to get it out. You only have to get the smaller final output. We already talked about machine learning.

      1340
      15:41:58,040 --> 15:41:59,560
      Dirk: And then uh,

      1341
      15:42:01,030 --> 15:42:20,909
      Dirk: there's uh on d work on on different architectures to be able to support this uh which opens up possibilities in both Hpc. And Cloud use uh Fpga's um various Gpu types that feeds into the Gpu workloads, but it's not exclusive to just uh

      1342
      15:42:21,080 --> 15:42:32,130
      Dirk: uh Gpu workloads, because it could also be machine learning like, How how do we integrating machine learning to make use of these new architectures? And that's gonna be

      1343
      15:42:32,750 --> 15:42:35,820
      Dirk: integration on D. But also basic uh

      1344
      15:42:35,910 --> 15:42:41,970
      Dirk: basic on the on, on some of these topics. And then there was some

      1345
      15:42:42,710 --> 15:42:50,129
      Dirk: things that we're kind of playing around with unique that are unique to the cloud where they offering platforms that we

      1346
      15:42:50,460 --> 15:43:07,240
      Dirk: kinda that's hard to replicate in-house uh like there's, some a big, very big table experiments function as a service. I don't know too much about it. We just threw it on here. Maybe Lindsay or Mike could say something about that, or someone else that's more familiar with that.

      1347
      15:43:10,780 --> 15:43:17,670
      Paolo Calafiura (he): I won't say that i'm familiar with functions as a service. But I I just want to mention that this is also

      1348
      15:43:17,690 --> 15:43:30,329
      Paolo Calafiura (he): um an area important for Hpc's. Then they are developing. They are developing a function that they probably the same at the same framework, the Funkx framework. Yes,

      1349
      15:43:30,340 --> 15:43:48,699
      Paolo Calafiura (he): and uh that there is that there is apparently a a sol in for to to the main Lcs of of of bunkx using something called. So. This is something we are very interested in to a Cc. Is a possible joint project across the

      1350
      15:43:48,710 --> 15:44:05,420
      Enrico Fermi Institute: so I guess from personal experience. Uh, we actually quite routinely use parcel uh for farming out the analysis jobs. Uh, and at some point back in the day there was a proof of concept

      1351
      15:44:05,430 --> 15:44:11,389
      Enrico Fermi Institute: using a Fung X endpoint and doing analysis jobs with that

      1352
      15:44:11,420 --> 15:44:36,939
      Enrico Fermi Institute: um. So all of the groundwork for that is actually been laid out. Um! And we could return to using that. We just ended up using desk a little bit more prevalent prevalently. But it's also something that's up up to the user at, or that we left up to the user at the end of the day, and if we want to develop more infrastructure around that uh it, we have a basis to start prompt

      1353
      15:44:36,950 --> 15:44:53,969
      Enrico Fermi Institute: uh as far as like going to like production workflows or reconstruction, or something like that. I don't think that's been explored at all. Um, but it's it. It looked really promising and interesting from the analysis uh analysis view of things.

      1354
      15:44:53,980 --> 15:45:07,179
      Enrico Fermi Institute: And I think at the time it was just a little bit immature compared to where things have gone more recently for bigquery and big table. I think this is actually

      1355
      15:45:07,960 --> 15:45:21,790
      Enrico Fermi Institute: uh, right. This is this. This was studied by Gordon Watson Company, and they did a a couple of benchmarkings of what the performance per dollar was for analysis, like queries on

      1356
      15:45:21,830 --> 15:45:26,670
      Enrico Fermi Institute: data sets backed by various engines,

      1357
      15:45:27,330 --> 15:45:44,309
      Enrico Fermi Institute: and we could go and take a look at that paper. But the gist of it was that bigquery and big table are uh not nearly as cost-efficient as uh using Rdf. For instance, or or coffee um, or well, awkward, or a plus uproot, for instance.

      1358
      15:45:44,320 --> 15:46:01,499
      Enrico Fermi Institute: So there's already some demonstrations that while these offerings are there, they're not quite up to the performance that we can already provide with our home grown tools. But maybe this also provides a uh way to talk with the bigger cloud services and say, Hey,

      1359
      15:46:01,510 --> 15:46:06,510
      Enrico Fermi Institute: this is the kind of performance we need. Can we do any? And beenance matching here?

      1360
      15:46:08,310 --> 15:46:15,509
      Dirk: What? Sorry That was a bit of an information. No, it's fine. But the thing is this is all

      1361
      15:46:16,260 --> 15:46:27,480
      Dirk: the question. One basic question I had about this is while while some of these areas that are being worked on can provide quite a a great

      1362
      15:46:27,500 --> 15:46:34,020
      Dirk: improvement and user experience like. And at the analysis level you you just

      1363
      15:46:34,090 --> 15:46:40,670
      Dirk: yeah to to what extent are the applicable. If you look at like a global picture of

      1364
      15:46:40,730 --> 15:46:48,980
      Dirk: experiment resource use, I mean that because the individual user experience doesn't necessarily mean you. You save a lot of

      1365
      15:46:48,990 --> 15:47:02,780
      Dirk: resources overall, but you you can make life easier for your users, and you improve the physics output, and that's all great. It's just um in terms of looking at that application of

      1366
      15:47:02,950 --> 15:47:09,230
      Dirk: of of money in terms is Is this a large enough area that that we have to

      1367
      15:47:10,220 --> 15:47:23,680
      Enrico Fermi Institute: how prominently should we put it into the report? Basically, that's what i'm trying to get

      1368
      15:47:23,690 --> 15:47:37,939
      Enrico Fermi Institute: as you make things more scalable, so that the folks can like you know, do the the first exploratory bits of their analysis from their laptop, and then scale that seamlessly into the cloud with Funkac, or whatever do that we admit

      1369
      15:47:38,120 --> 15:47:55,869
      Enrico Fermi Institute: um, if you can make it so that those first exploratory steps are at less scale, then of course, that means that the resource usage, as you scale up more and more, is going to be much more uniform between all the users that you one

      1370
      15:47:55,880 --> 15:48:08,630
      Enrico Fermi Institute: uh have engaging with the system, which means you can probably schedule it all a little bit better, as far as you know, which I think is, is another way of saying. You know you just make things nicer for the users. Um,

      1371
      15:48:08,640 --> 15:48:28,579
      Enrico Fermi Institute: but it one. It means that. Uh, uh, we are figuring out a schedule all that becomes easier, which means it becomes, uh, more efficient from your perspective, or from the operational perspective, I would say. And then, uh, it also changes the way in which people

      1372
      15:48:28,590 --> 15:48:51,250
      Enrico Fermi Institute: uh compete for resources at clusters, because all the analysis start looking more and more the same. Um. And they also start reaching the larger resources at a higher level of maturity than perhaps what you see even nowadays. Sometimes people just run stuff and see what happens. And it's very, very experimental, software let's say

      1373
      15:48:51,260 --> 15:48:54,349
      Enrico Fermi Institute: um. So I I

      1374
      15:48:54,520 --> 15:49:00,139
      Enrico Fermi Institute: to answer your question of like, is this big enough to care?

      1375
      15:49:00,760 --> 15:49:15,249
      Enrico Fermi Institute: I have a feeling that right now it is big enough to care, and the fact that we're getting more data is going to keep it in the regime of being big enough to care and report and make sure that we actually make a special or treat this,

      1376
      15:49:15,260 --> 15:49:40,909
      Enrico Fermi Institute: at least in a special way, because the resource uses pattern is wildly different from production. Um! But as we roll out these uh the things like functions as a service, or uh figure out how to scale a column or analysis, and our data frame effectively uh it's going to mink the competition. Or yeah, it's going to make the usage of resources less and easier to manage, which is kind of good for us.

      1377
      15:49:40,920 --> 15:49:53,019
      Enrico Fermi Institute: But also uh it's not going to make it a bigger piece of the competition for all the computing resources, So that's what it sort of looks like in my mind kind of extrapolating from what we have right now. One hundred and fifty.

      1378
      15:49:53,070 --> 15:50:12,099
      Enrico Fermi Institute: Uh, I think The answer then is, uh, we need. We need to watch it and see what these systems that are just starting to come online actually do for resource usage in uh, even if it's not at scale and see if it does bring kind of this evening out of of competition for resources at tier two is

      1379
      15:50:12,110 --> 15:50:15,289
      Enrico Fermi Institute: um and otherwise making the analysis,

      1380
      15:50:15,620 --> 15:50:21,180
      Enrico Fermi Institute: analysis, computing usage a bit more, even as far as

      1381
      15:50:21,370 --> 15:50:25,670
      Enrico Fermi Institute: sorry, even as far as Job submission goes. And things like that.

      1382
      15:50:25,860 --> 15:50:29,870
      Enrico Fermi Institute: That's that's sort of my view. I I of course.

      1383
      15:50:30,000 --> 15:50:38,340
      Enrico Fermi Institute: Yeah, this is trying to predict the future. So other people please feel free to predict the future, too, and we can see what works

      1384
      15:50:39,280 --> 15:50:57,220
      Paolo Calafiura (he): always always very informative to hear to hear from you Parents? Uh, I I'm certainly not nearly as competent, and I know that are more competent people in the call who may want to chime in. But uh, our interest from the Cc. Sign

      1385
      15:51:05,270 --> 15:51:24,750
      Paolo Calafiura (he): complex Enough that the paradox. And by the way, Derek, you yesterday we heard that that the Cms. Cms uh is um sort of fighting against the provisioning, challenging the provision challenges, you know, creating workers with the with the right.

      1386
      15:51:24,760 --> 15:51:28,160
      Paolo Calafiura (he): Uh, we divide the capabilities.

      1387
      15:51:28,170 --> 15:51:50,549
      Paolo Calafiura (he): Uh, you know to some extent that I don't know which has since, because i'm in combat, that these issues have been addressed by the by, the by, the folks with developed parts of So some of those issues uh have made the Atlas think that far so it could be a good back end for some of our existing code in in this sort of

      1388
      15:51:50,560 --> 15:51:56,159
      Paolo Calafiura (he): and I I I i'm hoping that somebody has more competent jump.

      1389
      15:51:57,290 --> 15:52:13,480
      Enrico Fermi Institute: Um! The only thing that I can tack on to that is that uh Anna and Company back in the day uh figured out how to make a back filling system uh using funkx and parcels. So that's that's definitely something that works

      1390
      15:52:13,530 --> 15:52:29,769
      Enrico Fermi Institute: Um, and you can, and that's also what the guys at Nebraska are doing with the last or with the the coffee Casa analysis facility as they're back filling into the production jobs. So for sure, this is a pattern that works, and that people can implement. But,

      1391
      15:52:29,780 --> 15:52:34,630
      Enrico Fermi Institute: uh, we also we don't. We don't know how it how it scales out uh

      1392
      15:52:34,750 --> 15:52:43,950
      Enrico Fermi Institute: you know, to more and more data and more and more users. The The usage right now, I would say, is fairly limited. And yeah, that's

      1393
      15:52:45,020 --> 15:52:50,759
      Enrico Fermi Institute: I. I think that helps at some context. But we definitely need to hear from more people on this,

      1394
      15:52:51,470 --> 15:52:59,310
      Dirk: hey? Maybe just one comment that Jeff, we're primarily interested in production here. But, on the other hand, analysis takes over

      1395
      15:52:59,610 --> 15:53:06,270
      Dirk: half our resources or half the tools, at least, so there's a significant fraction. So if analysis gets easier you

      1396
      15:53:06,690 --> 15:53:13,279
      Dirk: that means maybe there's more resources for production to use just as a quick correction. It's only a quarter dirk.

      1397
      15:53:13,390 --> 15:53:18,340
      Dirk: Oh, it's a quarter of it. You I thought it's half the T. Choose. Now it's a quarter

      1398
      15:53:18,530 --> 15:53:20,280
      Dirk: that's a quarter. Now. Okay,

      1399
      15:53:20,350 --> 15:53:28,460
      Enrico Fermi Institute: yeah, as a as more production just shows up the the the fraction gets smaller and smaller.

      1400
      15:53:33,200 --> 15:53:46,199
      Enrico Fermi Institute: But Yeah, there, I mean, just thinking about it more. There's also this rather severe impedance mismatch, at least right now, with the kind of the can. The cadence of analysis jobs versus uh production cops,

      1401
      15:53:46,210 --> 15:53:55,879
      Enrico Fermi Institute: since it's much more bursty and short-lived as opposed to a production job that comes in, and you know it's going to use twenty four hours in a slot or something like that.

      1402
      15:53:56,180 --> 15:54:02,060
      Enrico Fermi Institute: So it's by it. By its very nature it's a much more adaptive

      1403
      15:54:02,510 --> 15:54:06,890
      Enrico Fermi Institute: and reactive scheduling problem.

      1404
      15:54:20,280 --> 15:54:28,630
      Enrico Fermi Institute: So one of the things that we mentioned with the cloud offerings, I mean, we had a couple of examples. There are big, very big table functions of the service.

      1405
      15:54:28,650 --> 15:54:47,950
      Enrico Fermi Institute: One of the questions I had at least, was it. Is there anything i'm missing right like on the cloud? Right? Because if you go and look at the service catalog for something like aws. It has this humongous, you know, spread of, of of things that they can services that they offer. Uh, is there anything that we're

      1406
      15:54:47,990 --> 15:54:49,940
      Enrico Fermi Institute: leaving on the table that we should

      1407
      15:54:50,600 --> 15:54:51,950
      Enrico Fermi Institute: you should look into?

      1408
      15:54:55,200 --> 15:54:59,800
      Enrico Fermi Institute: Uh, I'll say that something that's interesting.

      1409
      15:55:00,150 --> 15:55:18,890
      Enrico Fermi Institute: Maybe not. Maybe not just for uh clouds, but also for sort of on premises. Facilities is uh things like sonic that lets us sort of um disaggregate the gpus and the cpus. So if you're doing inference, you might not need a whole Gpu. But

      1410
      15:55:18,900 --> 15:55:27,490
      Enrico Fermi Institute: you know, as someone you know, either you buy very expensive, Let's say in the cloud case. Let's just stick that. So, you know you might have to buy. You might be buying a bunch of Gpu nodes

      1411
      15:55:27,500 --> 15:55:39,980
      Enrico Fermi Institute: uh which are many times more expensive. But you know, if the reconstruction path only needs a quarter of a gpu being able to independently scale up the number of gpus and cpus that you're running at a time. Um,

      1412
      15:55:39,990 --> 15:55:51,770
      Enrico Fermi Institute: it's something useful. And then, like I mentioned like for on premises stuff, too, because you can stick either two or four of these gpus into a box. But if the core count is two hundred and fifty-six on the node, then

      1413
      15:55:52,010 --> 15:55:54,990
      Enrico Fermi Institute: you you better hope that the the

      1414
      15:55:55,060 --> 15:56:01,679
      Enrico Fermi Institute: the fraction of time that you're spending a gpu and the speed up that you get, you know, and dolls, law and all that actually makes it worthwhile

      1415
      15:56:12,330 --> 15:56:13,160
      you.

      1416
      15:56:19,070 --> 15:56:38,129
      Enrico Fermi Institute: Yes, and and going on to that like there's also going there is going to be uh, and there already is, and it will be an ever growing class of analysis user, that is asking for Gpus, too, and you have to again deal with this very different rate of scheduling resources for them.

      1417
      15:56:38,430 --> 15:56:55,730
      Enrico Fermi Institute: Um, and sometimes there the amount of, or at least the the the burstiness of the data processing that they're trying to do on that Tv is much much higher compared to like a production job. Even if the resource, the total resources, are much higher on the production side, just because of job multiplicity

      1418
      15:56:55,740 --> 15:57:19,540
      Enrico Fermi Institute: that you have users that are, you know, just poking around doing their exploratory stuff, and right now we give them a whole T four. Well, t four per hour is not cheap, not cheap at all. So and you'll have people like training models and then loading it onto a T for running their running, running their whole signal data set, or something like that, to see what it looks like in the tails, et cetera, et cetera, or running it on their backgrounds.

      1419
      15:57:19,580 --> 15:57:24,290
      Enrico Fermi Institute: And it's still the same problem of needing to

      1420
      15:57:24,450 --> 15:57:42,980
      Enrico Fermi Institute: uh, very piecemeal uh schedule your gpus, and then on top of that schedule, all the networking between them, because you have this really insane burst of uh inference requests for a very short amount of time that you need to negotiate on your network to not net or not mess with everyone else's jobs.

      1421
      15:57:43,170 --> 15:57:44,580
      Enrico Fermi Institute: So

      1422
      15:57:44,620 --> 15:57:54,399
      Enrico Fermi Institute: it might not be. It might not be a huge what you said It's a quarter of the tier two right now. It's. Let's say it just stays a quarter of that. But the

      1423
      15:57:54,590 --> 15:58:09,069
      Enrico Fermi Institute: the the way that it's going to be using the resources if it's that bursty may not look like a quarter at certain points in time during the analysis workflow, and that's something we have to be ready to deal with.

      1424
      15:58:09,370 --> 15:58:13,230
      Enrico Fermi Institute: I have no idea how to actually schedule that.

      1425
      15:58:13,490 --> 15:58:14,539
      Mhm

      1426
      15:58:19,200 --> 15:58:23,320
      Enrico Fermi Institute: So so we're almost at the top of the hour.

      1427
      15:58:23,800 --> 15:58:28,420
      Enrico Fermi Institute: So any other topics that we wanted to hit before we wrap up for the day.

      1428
      15:58:41,590 --> 15:58:47,809
      Enrico Fermi Institute: So I think logistically, we were going to tomorrow. Talk a little bit about.

      1429
      15:58:49,090 --> 15:58:54,949
      Enrico Fermi Institute: See? In the morning I think we were going to talk about accounting and pledging.

      1430
      15:58:55,240 --> 15:58:57,530
      Enrico Fermi Institute: We're going to talk about some, you know.

      1431
      15:58:57,840 --> 15:59:14,780
      Enrico Fermi Institute: Facility, features, policies. How did a discussion about security topics when it comes to Hpc. And Cloud. Um: Yeah. Allocations, you know, planning that sort of thing, I think, in the afternoon,

      1432
      15:59:14,790 --> 15:59:18,350
      Enrico Fermi Institute: and have a a presentation from the

      1433
      15:59:18,520 --> 15:59:22,869
      Enrico Fermi Institute: from the Vera Rubin folks to talk about their experiences.

      1434
      15:59:23,700 --> 15:59:42,449
      Enrico Fermi Institute: And then, yeah, some summary type of work and and just you know other other topics or observations that people would like to bring up. So I mean, if there's something that that hasn't that we haven't hit on the agenda that people would really like to talk about um tomorrow afternoon. It'd be a really good time to to bring that

      1435
      15:59:47,150 --> 15:59:49,349
      Enrico Fermi Institute: anything else from anyone.

      1436
      15:59:55,150 --> 16:00:00,209
      Enrico Fermi Institute: Okay, sounds like, Not all right, Thanks, everybody. We'll talk to you tomorrow.

      1437
      16:00:01,790 --> 16:00:03,559
      Fernando Harald Barreiro Megino: Hi. Thank you.

      • 13:00
        R&D 20m

        LCF integration - GPUs
        Machine Learning
        Unique cloud offerings
        Architectures - FPGA, ARM
        BigQuery/BigTable
        Functions-as-a-service

        [Easter time]

         

        CERN presentation

         

        [Eastern Time]

         

        CERN presentation

        615
        14:00:29,710 --> 14:00:34,679
        Enrico Fermi Institute: I think this is the last session that's focused exclusively on cloud.

        616
        14:00:36,900 --> 14:00:37,920
        Yeah.

        617
        14:00:38,670 --> 14:00:44,219
        Enrico Fermi Institute: In the next session we'll talk about some R and D things and and networking. So

        618
        14:00:52,660 --> 14:00:57,720
        Enrico Fermi Institute: okay, so maybe we break here and we'll we'll uh see everybody at one o'clock central time,

        619
        14:00:58,540 --> 14:01:00,130
        Fernando Harald Barreiro Megino: so you know

        620
        14:01:01,310 --> 14:01:02,620
        Enrico Fermi Institute: she learning,

        621
        14:01:03,820 --> 14:01:09,699
        Enrico Fermi Institute: and then we'll we'll go back to the the topics as presented in the slides,

        622
        14:01:10,610 --> 14:01:12,850
        Enrico Fermi Institute: so we'll just get started in a few minutes here,

        623
        14:01:53,380 --> 14:01:56,370
        Maria Girone: so it's a Eric starting first, right.

        624
        14:01:56,540 --> 14:02:04,520
        Maria Girone: Yeah, if Eric is ready to present, we thought maybe it would be best to just have them

        625
        14:02:07,990 --> 14:02:16,680
        Enrico Fermi Institute: getting a little bit late. Concerned. Yeah, exactly. We want to be considerate of people's time in your especially. Thank you.

        626
        14:02:29,590 --> 14:02:39,579
        Enrico Fermi Institute: So just give it like two more minutes, and then um, Eric, whenever you're ready to, you know. Put your slides up. I'll I'll stop sharing here. Um, when we get started shortly.

        627
        14:02:42,740 --> 14:02:47,450
        Eric Wulff: Sounds good. I'm uh ready. Whenever So just let me know. Okay,

        628
        14:02:48,630 --> 14:02:49,570
        you

        629
        14:02:54,390 --> 14:02:55,219
        The

        630
        14:03:09,350 --> 14:03:17,650
        Enrico Fermi Institute: It seems like the rate at which people have. Uh that rejoining has has slowed down significantly. So I think you can go ahead and and and and start

        631
        14:03:22,080 --> 14:03:23,529
        Eric Wulff: uh, Okay.

        632
        14:03:24,610 --> 14:03:25,870
        Eric Wulff: So

        633
        14:03:27,290 --> 14:03:31,050
        Eric Wulff: i'm sharing. Now, I think. Can you see?

        634
        14:03:31,340 --> 14:03:33,999
        Eric Wulff: Yes, it looks good. Okay, great.

        635
        14:03:34,560 --> 14:03:37,929
        Eric Wulff: Um. So I I just have a

        636
        14:03:38,180 --> 14:03:52,689
        Eric Wulff: two or three slides here. So it's a very short presentation just to talk a little bit about what we have been doing uh regarding distributed training and hypertuning uh of deep learning based algorithms using you have from us computing.

        637
        14:03:53,360 --> 14:04:00,499
        Eric Wulff: So this is something that I have been doing in context of the A Eu Funded Research project called Say, We race

        638
        14:04:06,260 --> 14:04:08,620
        Eric Wulff: involved in this, and she's my supervisor.

        639
        14:04:09,580 --> 14:04:10,969
        Um.

        640
        14:04:12,850 --> 14:04:15,450
        So let's see if I can change slide.

        641
        14:04:15,770 --> 14:04:17,940
        Eric Wulff: Yes, um.

        642
        14:04:18,590 --> 14:04:24,429
        Eric Wulff: So just for for if you're not aware uh hyper parameter organization. Um.

        643
        14:04:25,320 --> 14:04:35,079
        Eric Wulff: So if you're not aware of what that is, I've tried to use it very quickly here in just one slide. So it's. I will sometimes refer to it as as a hyper tuning,

        644
        14:04:35,140 --> 14:04:36,670
        Eric Wulff: and um,

        645
        14:04:36,730 --> 14:04:39,300
        Eric Wulff: it's basically to um

        646
        14:04:39,340 --> 14:04:49,350
        Eric Wulff: to tune the uh hyper parameters all the an Ai model or a deep learning model, and hyper parameters are simply the model sets. Um

        647
        14:04:58,840 --> 14:05:09,139
        Eric Wulff: um, and they can define things like the model architecture. So, for instance, how many layers you have in your neural network? Um, How many notes you have in each layer, and so on.

        648
        14:05:09,520 --> 14:05:19,239
        Eric Wulff: Um, but they also define things. Um, that has to do with the optimization of the model, such as the learning rates, the back size and so forth.

        649
        14:05:19,720 --> 14:05:20,570
        Yeah.

        650
        14:05:22,180 --> 14:05:28,950
        Eric Wulff: Now, if you have a a a large model, or a very top complex model, which it requires a lot of compute to

        651
        14:05:29,220 --> 14:05:30,469
        Eric Wulff: and

        652
        14:05:31,480 --> 14:05:33,510
        Eric Wulff: uh, to do the forward pass,

        653
        14:05:33,610 --> 14:05:34,950
        Eric Wulff: and

        654
        14:05:35,630 --> 14:05:38,329
        Eric Wulff: and or you have a large data sets.

        655
        14:05:38,360 --> 14:05:41,660
        Eric Wulff: Um. Hypertine can be extremely

        656
        14:05:41,940 --> 14:05:56,630
        Eric Wulff: compute resource intensive. So, therefore it can benefit greatly from Hbc. Resources. And uh, Furthermore, we need a of smart and efficient solid search algorithms to find good hyper parameters, so that we we don't waste the Hpc resources that we have

        657
        14:05:59,290 --> 14:06:00,480
        Eric Wulff: um.

        658
        14:06:01,000 --> 14:06:10,500
        Eric Wulff: So in race uh, I have been working with uh a group working on machine and particle flow uh, which is a

        659
        14:06:10,810 --> 14:06:13,939
        Eric Wulff: uh in collaboration with Cms

        660
        14:06:14,080 --> 14:06:17,230
        Eric Wulff: with people from Cms. Um, And

        661
        14:06:17,420 --> 14:06:19,599
        Eric Wulff: in order to high opportunity, this model

        662
        14:06:19,690 --> 14:06:25,310
        Eric Wulff: um in race we have been using uh an open source framework called rate you

        663
        14:06:25,750 --> 14:06:34,059
        Eric Wulff: uh, which allows us to run many different trials in parallel, using uh multiple gpus per trial

        664
        14:06:34,270 --> 14:06:39,010
        Eric Wulff: uh, which is uh what this picture up here is trying to represent.

        665
        14:06:39,570 --> 14:06:40,990
        Eric Wulff: And

        666
        14:06:42,990 --> 14:06:51,389
        Eric Wulff: now, with Rachel we can also get the very nice overview of the different trials, and we can. We can pick the one that we see, performs the best

        667
        14:06:51,580 --> 14:06:57,289
        Eric Wulff: uh and right, and also has a lot of different search algorithms that uh

        668
        14:06:57,660 --> 14:07:01,359
        Eric Wulff: help us to in the the right uh

        669
        14:07:01,690 --> 14:07:02,970
        Eric Wulff: I, the parameters.

        670
        14:07:03,430 --> 14:07:18,949
        Eric Wulff: And here, to the right, we have an example of of the kind of a difference this can make to to the learning of the model. So Here we have plotted the um training and validation losses for, and after hyper tuning,

        671
        14:07:20,620 --> 14:07:32,120
        Eric Wulff: so as you can see here, the the loss went down quite a bit after hypertuning almost by a factor of two, and the furthermore, the the training seems to be much more stable. We have a

        672
        14:07:32,380 --> 14:07:36,559
        Eric Wulff: these bands which will present the the standard deviation of

        673
        14:07:36,750 --> 14:07:42,170
        Eric Wulff: between different trainings. It's it's much more stable in the right plot.

        674
        14:07:47,030 --> 14:07:56,090
        Eric Wulff: Um and I just had one more slide here to sort of illustrate how you still uh high performance computing can be in order to speed up

        675
        14:07:56,810 --> 14:07:58,380
        parameter optimization.

        676
        14:07:58,560 --> 14:08:03,430
        Eric Wulff: Uh. So this just shows the scaling uh from four to twenty-four

        677
        14:08:03,680 --> 14:08:05,309
        Eric Wulff: computing notes.

        678
        14:08:05,330 --> 14:08:06,550
        Eric Wulff: Um,

        679
        14:08:06,990 --> 14:08:15,439
        Eric Wulff: maybe particularly looking at the plot to the right here we can see that the scaling for this use case is actually better than linear

        680
        14:08:15,570 --> 14:08:20,269
        Eric Wulff: um, which at least in part has to do with, uh

        681
        14:08:20,820 --> 14:08:26,109
        Eric Wulff: some excessive reloading of models that happens when when we have the few notes.

        682
        14:08:28,060 --> 14:08:29,150
        Eric Wulff: Um.

        683
        14:08:31,070 --> 14:08:35,830
        Eric Wulff: So. Um: Well, this basically means that the more the more

        684
        14:08:36,030 --> 14:08:41,099
        Eric Wulff: notes we have, the more people we have with the faster we can tune and apply these bottles.

        685
        14:08:41,670 --> 14:08:47,480
        Eric Wulff: That's all I had for for this.

        686
        14:08:48,740 --> 14:08:58,029
        Enrico Fermi Institute: Can you tell a priori from the model that that you'll that the model you're using will

        687
        14:08:58,080 --> 14:09:04,340
        Enrico Fermi Institute: force up behavior, so that if someone comes with any given model, you know how to sort of shape the work,

        688
        14:09:06,550 --> 14:09:15,609
        Enrico Fermi Institute: if you understand what I mean and no, no. What I mean is, you discovered that you get better than linear scaling with this training?

        689
        14:09:15,700 --> 14:09:16,719
        Right?

        690
        14:09:17,160 --> 14:09:22,499
        Enrico Fermi Institute: That's not always the case with, Or is that the case with any given model.

        691
        14:09:23,150 --> 14:09:24,459
        Um,

        692
        14:09:25,150 --> 14:09:33,199
        Eric Wulff: yeah, I think it would be so. This is sort of uh. This is showing the scaling of the hyper parameter organization itself.

        693
        14:09:33,650 --> 14:09:40,180
        Eric Wulff: Um, so it's not. If if you had just a single training, it wouldn't scale like this it would be

        694
        14:09:40,360 --> 14:09:42,610
        Eric Wulff: a a bit worse than linear probably.

        695
        14:09:45,610 --> 14:09:51,289
        Eric Wulff: But So the way that the hypertuning works in this case is that we

        696
        14:09:51,430 --> 14:09:53,199
        Eric Wulff: we launched a bunch of

        697
        14:09:53,690 --> 14:09:56,980
        Eric Wulff: trials in parallel with different type of parameter

        698
        14:09:57,010 --> 14:09:58,559
        Eric Wulff: configurations.

        699
        14:09:58,990 --> 14:10:00,189
        Eric Wulff: And then

        700
        14:10:00,340 --> 14:10:01,780
        Eric Wulff: um!

        701
        14:10:02,230 --> 14:10:10,820
        Eric Wulff: There is a sort of a scheduling or search algorithm, looking at how well all these trials perform,

        702
        14:10:10,940 --> 14:10:22,829
        Eric Wulff: and then it's a terminates once that look less promising and continuous training, the ones that look promising. And then we can also have some kind of base and optimization

        703
        14:10:23,190 --> 14:10:26,360
        Eric Wulff: component here, which tries to predict which

        704
        14:10:27,470 --> 14:10:31,230
        Eric Wulff: hyper parameters would perform. Well, and then we try those next,

        705
        14:10:32,930 --> 14:10:39,059
        Enrico Fermi Institute: and if you were to double or triple the number of nodes you would continue to does the

        706
        14:10:39,310 --> 14:10:42,929
        Enrico Fermi Institute: does the actual growth begin to flat now?

        707
        14:10:43,430 --> 14:11:00,910
        Eric Wulff: Um! I I haven't tested this um more than up to twenty four notes uh, so I can't say for sure, but I I imagine it will continue for at least a bit more. But um I I can't say for how long, and

        708
        14:11:01,060 --> 14:11:16,039
        Enrico Fermi Institute: I I also see that eventually it would flack off.

        709
        14:11:17,080 --> 14:11:18,540
        Eric Wulff: Um,

        710
        14:11:19,510 --> 14:11:23,909
        Enrico Fermi Institute: because that's all it. Yeah, it it's nothing. The issue is a resource contention.

        711
        14:11:24,600 --> 14:11:30,520
        Eric Wulff: Yeah, it's a that has to do with the with this search. Algorithm That's

        712
        14:11:30,630 --> 14:11:32,309
        Eric Wulff: um

        713
        14:11:33,180 --> 14:11:39,990
        Eric Wulff: trains a few trials and then terminates bad once, and then continues with new ones. So

        714
        14:11:40,360 --> 14:11:48,789
        Eric Wulff: if you have more more trials than you have notes that that you want to run uh. You have to sort of the

        715
        14:11:49,280 --> 14:11:54,179
        Eric Wulff: post trials at some point, and can and start training other ones.

        716
        14:11:54,590 --> 14:11:56,110
        Eric Wulff: Um!

        717
        14:11:56,270 --> 14:12:02,699
        Eric Wulff: Because you need to trade all the trials up to the same epoch number before you decide which ones to keep, and not

        718
        14:12:04,140 --> 14:12:11,450
        Eric Wulff: so it. It. It doesn't have to do with ray tune per se. It just has to do with the the particular search algorithm or

        719
        14:12:11,530 --> 14:12:15,219
        Eric Wulff: a lot of search algorithms actually work work like that.

        720
        14:12:18,070 --> 14:12:19,019
        Yeah,

        721
        14:12:19,250 --> 14:12:21,929
        Enrico Fermi Institute: you have a question or comment for me in the chat.

        722
        14:12:22,100 --> 14:12:40,870
        Ian Fisk: Yeah, I had a question for Eric which was, and maybe it's too early to tell. But my question was, how stable you expected the hyper parameter tuning to be in the sense that are we expecting that every time we change network or get new data, we're going to have to re-optimize the hyper parameters. Or is this something that

        723
        14:12:40,880 --> 14:12:50,119
        Ian Fisk: um that once we sort of ha I optimize for a particular problem that we may find that those are stable over periods of time. The reason, I ask is that This seems like A.

        724
        14:12:50,620 --> 14:12:59,900
        Ian Fisk: When we talk about the use of Hpc. Or clouds and specialized resources, like training is A is a big part of how we tend to use them. But the hyper parameter

        725
        14:13:00,190 --> 14:13:11,330
        Ian Fisk: optimization sort of increases that by a factor of fifty or so. And so, if we have to do it each time. We probably need to factor those things in in our thoughts about how we're where we're constrained resources.

        726
        14:13:12,110 --> 14:13:14,099
        Eric Wulff: Yeah, so

        727
        14:13:14,770 --> 14:13:16,039
        Eric Wulff: um,

        728
        14:13:16,760 --> 14:13:23,389
        Eric Wulff: it. It would completely depend on how much you change your model, or how much you change the problem.

        729
        14:13:23,470 --> 14:13:24,989
        Eric Wulff: I mean, if you're

        730
        14:13:25,010 --> 14:13:27,139
        Eric Wulff: if you change your model

        731
        14:13:27,180 --> 14:13:32,739
        Eric Wulff: architecture, I it, you will probably have to run a new hyper primary organization.

        732
        14:13:32,770 --> 14:13:38,310
        Eric Wulff: Um, because you might do not even have the same hyper parameters in your model anymore.

        733
        14:13:38,550 --> 14:13:40,150
        Eric Wulff: Uh,

        734
        14:13:40,610 --> 14:13:56,560
        Eric Wulff: and but But you know, if if things aren't two different, you might not have to to hypertune, you might. You might, or just, maybe, as to a small hyper tuning, you know, just a few parameters in in some narrow or small search space.

        735
        14:13:56,690 --> 14:13:58,640
        Eric Wulff: So, for instance,

        736
        14:13:59,020 --> 14:14:00,809
        Eric Wulff: if you look at other

        737
        14:14:00,840 --> 14:14:01,950
        Eric Wulff: uh

        738
        14:14:02,920 --> 14:14:06,280
        Eric Wulff: ah, other fields, such as, for instance, a

        739
        14:14:06,390 --> 14:14:09,070
        Eric Wulff: image recognition, or all the detection.

        740
        14:14:09,210 --> 14:14:26,879
        Eric Wulff: Um, if you find a network that performs well on, you know, classifying certain kinds of objects uh, then it's very likely that they, you know, using the same, have a parameters. It would be good at classifying other kinds of objects as well. If you just have labour data for for those objects.

        741
        14:14:26,890 --> 14:14:29,329
        So in that case, probably you wouldn't have to

        742
        14:14:31,100 --> 14:14:33,510
        Eric Wulff: run a full hyper-prampt organization again.

        743
        14:14:37,260 --> 14:14:46,599
        Ian Fisk: Thanks. It's a it's it's thanks. It's It's very impressive. The amount that it improves the situation by doing the separately. Getting a factor of two is nice

        744
        14:14:48,460 --> 14:14:49,360
        Eric Wulff: Thanks.

        745
        14:14:50,810 --> 14:15:07,050
        Paolo Calafiura (he): A question or comment from Paul. Yes, I hope to the question I miss. I missed the the first couple of nights. Sorry of the question. I wasn't address there. So my question is here You're starting to show the the scaling at four nodes,

        746
        14:15:07,060 --> 14:15:13,339
        Paolo Calafiura (he): and I wonder what would the scaling look like if you compare it with a single null or in a single gpu.

        747
        14:15:14,870 --> 14:15:16,540
        Eric Wulff: Um.

        748
        14:15:26,890 --> 14:15:32,669
        Eric Wulff: The few notes you have the more all this excessive reloading has to happen.

        749
        14:15:32,930 --> 14:15:37,320
        Eric Wulff: So you're just just using one now would be very, very slow.

        750
        14:15:37,510 --> 14:15:50,440
        Paolo Calafiura (he): But that's because of the way does it does this business. It's because of the search algorithm we use. So it's not the way to per. Say It's the

        751
        14:15:51,360 --> 14:15:58,859
        Eric Wulff: it's because of the algorithm you. You wouldn't be able to run this faster with another framework. Well, I mean

        752
        14:15:59,760 --> 14:16:18,139
        Paolo Calafiura (he): it. It. It. It's the algorithms problem, not way, too. So it's It's a little bit harder than to to do the the comparison. I mean, i'm thinking, if you use psychic labels like it, optimize on single Gpu to do to do the same thing. And then, of course, there is the question, What is the

        753
        14:16:22,910 --> 14:16:26,699
        Paolo Calafiura (he): Okay, it's It's a complicated question.

        754
        14:16:29,870 --> 14:16:32,029
        Okay? Next we have

        755
        14:16:34,400 --> 14:16:45,700
        Shigeki: Uh yeah, I'm gonna show my ignorance here. Um, just trying to understand exactly how this works. Uh: I think i'm on the first slide. Second slide.

        756
        14:16:45,730 --> 14:16:54,140
        Shigeki: You show the trial one trial to trial trial, and those trials are independent of each other. Right? They're all working on.

        757
        14:16:54,440 --> 14:17:12,849
        Shigeki: Okay, uh. The next thing here is that presumably they're they're reading the same set of data over in a uh in order to train uh, they don't. They're completely independent in terms of of the of the where they are in the input. Stream. Right? They're. They're not like working in lockstep or anything.

        758
        14:17:13,630 --> 14:17:25,690
        Eric Wulff: This is prior one. So it it. It depends on the kind of the search algorithm that you use the hyper perimeter search algorithm So um

        759
        14:17:26,590 --> 14:17:27,650
        Eric Wulff: in um.

        760
        14:17:28,350 --> 14:17:40,270
        Eric Wulff: Well, to to be with you. You you can choose not to use any particular search algorithm and then everything is just done uh in parallel sort of um,

        761
        14:17:40,560 --> 14:17:41,710
        Eric Wulff: however,

        762
        14:17:42,000 --> 14:17:53,250
        Eric Wulff: and it's it's It's much more efficient to use some kind of search. Algorithm So then um! You would want to train all the trials up to a certain

        763
        14:17:53,570 --> 14:17:58,200
        Eric Wulff: epoch number. Let's say you train them all up to you. Put five, and then you look at

        764
        14:17:58,530 --> 14:18:08,800
        Eric Wulff: uh, they have some algorithm that decides which wants to terminate, and which ones to continue training, and in place of the ones you terminated, you start new trials

        765
        14:18:08,820 --> 14:18:12,450
        Eric Wulff: with the with new hyper parameter configurations.

        766
        14:18:12,500 --> 14:18:19,529
        Eric Wulff: Um. So then, that if you have many more trials, then you have confused notes. You have to

        767
        14:18:19,720 --> 14:18:27,839
        Eric Wulff: pause some of some trials at a point five, and then load in new trials and train them out until they book five.

        768
        14:18:28,230 --> 14:18:30,749
        Shigeki: Okay. So

        769
        14:18:31,070 --> 14:18:35,280
        Shigeki: okay. But to a certain extent, though the the the trials are running independent,

        770
        14:18:35,290 --> 14:18:51,889
        Shigeki: and they get synchronized at some point by by the Atlantic. That you that you're that you're stopping at. But other than that within up to that epoch point uh they're running. They're they're what they're blasting through the the the data as quickly as they they they can. And And so they? They're not in sync. Okay,

        771
        14:18:52,640 --> 14:18:53,690
        Shigeki: thank you.

        772
        14:18:56,430 --> 14:18:59,330
        Enrico Fermi Institute: So how long does it take to run this on,

        773
        14:18:59,370 --> 14:19:07,800
        Enrico Fermi Institute: you know, for for one node? You know. How long is it running the the hyper parameter optimization in terms of all, all all time? Hours?

        774
        14:19:08,120 --> 14:19:09,599
        Eric Wulff: Um!

        775
        14:19:10,010 --> 14:19:11,059
        Eric Wulff: So

        776
        14:19:11,130 --> 14:19:21,010
        Eric Wulff: that that can vary a lot, depending on how large your search basis and the can, what we use and the data that we use, and so on, I think for the for the results I show here.

        777
        14:19:21,310 --> 14:19:22,860
        Eric Wulff: Um

        778
        14:19:23,820 --> 14:19:26,859
        Eric Wulff: uh, If I remember correctly,

        779
        14:19:27,120 --> 14:19:33,029
        Eric Wulff: the whole thing took uh around eighty hours in

        780
        14:19:33,190 --> 14:19:35,740
        Eric Wulff: in wall time,

        781
        14:19:35,980 --> 14:19:40,909
        Eric Wulff: and that's using uh that was using uh twelve

        782
        14:19:40,930 --> 14:19:45,800
        Eric Wulff: confused notes with four to us each.

        783
        14:19:45,810 --> 14:20:11,110
        Enrico Fermi Institute: That can be, you know, trivially broken up into into multiple drops and things like that. The reason I ask is one of the things I notice is that on you know some of the Hpcs uh, at least in the Us. Right. They they have, you know, maximum wall time, for you know you jobs in the queues right? So like I'm i'm looking at, you know pearl matter right now, and it says you can have a a gpu job in the regular queue uh for twelve hours at most.

        784
        14:20:11,120 --> 14:20:15,659
        Enrico Fermi Institute: And so i'm wondering like, what what useful work can we get done, or

        785
        14:20:15,870 --> 14:20:25,280
        Enrico Fermi Institute: you know, hyperparameter, optimization or machine learning in general, you know, given the relatively short maximum of all time.

        786
        14:20:25,450 --> 14:20:29,280
        Eric Wulff: Um. So one solution is to uh

        787
        14:20:29,460 --> 14:20:31,290
        Eric Wulff: tick points. The

        788
        14:20:31,950 --> 14:20:39,149
        Eric Wulff: the the search, and then just launch it again and continue where you left off. So the we're able to do that. So

        789
        14:20:39,190 --> 14:20:44,300
        Eric Wulff: we are saving checkpoints regularly through the the workload.

        790
        14:20:45,570 --> 14:20:47,679
        Eric Wulff: Okay? And uh, yeah,

        791
        14:20:47,820 --> 14:20:50,360
        Enrico Fermi Institute: how often do you save the checkpoints?

        792
        14:20:51,280 --> 14:21:07,169
        Eric Wulff: Um, That's configurable, But usually once per epoch. So once once per read through data sets.

        793
        14:21:08,020 --> 14:21:15,920
        Eric Wulff: Uh that. That depends a lot also. But um, let's say you around well between twelve and twenty four hours.

        794
        14:21:17,110 --> 14:21:20,540
        Eric Wulff: But this completely depends on how much data you have. And uh,

        795
        14:21:21,140 --> 14:21:24,060
        Eric Wulff: you know the the particular model they use.

        796
        14:21:24,530 --> 14:21:41,880
        Enrico Fermi Institute: That's an epoch for the hyper parameter optimization itself, not just the the neural net a single instance of the neural network

        797
        14:21:42,740 --> 14:21:45,710
        twenty-four hours for a single,

        798
        14:21:46,740 --> 14:21:53,449
        Eric Wulff: and that's um. So that you know we have quite a big data set. So that's

        799
        14:21:53,510 --> 14:22:00,430
        Eric Wulff: why. But we're also using four G four, and the J. One hundred gpus for that. So

        800
        14:22:00,820 --> 14:22:02,320
        Eric Wulff: if you have a

        801
        14:22:02,640 --> 14:22:05,420
        Eric Wulff: all the gpus that would take much longer,

        802
        14:22:08,980 --> 14:22:19,460
        Enrico Fermi Institute: I I guess What I'm wondering is, you know, for for the report, should we, you know, have some recommendation that the the policies at these sites. You know how

        803
        14:22:20,140 --> 14:22:25,540
        Enrico Fermi Institute: you know much longer Gpu jobs to run to do these sorts of tasks.

        804
        14:22:26,090 --> 14:22:29,069
        Eric Wulff: Well, my opinion is that it would be

        805
        14:22:29,720 --> 14:22:47,669
        Enrico Fermi Institute: it would be convenient to see if it if we could. But you know it's not deal breaking, because we can't checkpoint this, and just to relo right. But can you for it? You just said your your epochs are twelve to twenty-four hours, and Lincoln just said that

        806
        14:22:47,720 --> 14:22:57,990
        Eric Wulff: twelve hours. So the sorry sorry sorry. So I uh, yeah, yeah, I I I I spoke here so

        807
        14:22:58,500 --> 14:23:13,459
        Eric Wulff: uh apologies. It's a bit late over here. So it it takes it takes twenty-four hours for a full training. Not for one.

        808
        14:23:13,470 --> 14:23:33,439
        Enrico Fermi Institute: We're not asking for a policy change, right? Just a behavioral change with checkpointing. And you're saving at the end of each full training or each actual. So it's as much. Uh: yeah, yeah, Sorry for it. You have, like two hundred epochs. Is that right? Yeah, you're probably having the plot.

        809
        14:23:33,650 --> 14:23:37,789
        Eric Wulff: Uh: yeah, yeah, in the plot here. So Um:

        810
        14:23:38,030 --> 14:23:56,069
        Eric Wulff: yeah. And so the this is plot from last year. Now we have a large data set, and we train for about a hundred epochs, and that takes uh roughly, twenty four hours.

        811
        14:23:57,900 --> 14:23:59,820
        Enrico Fermi Institute: Okay, Um,

        812
        14:24:00,170 --> 14:24:13,310
        Enrico Fermi Institute: yeah, with adding more gpus per node help you in terms of a number of epochs? Or do you have enough data to get reasonable convergence with, or at least with this model after one hundred? You

        813
        14:24:21,110 --> 14:24:22,430
        Eric Wulff: actually we are.

        814
        14:24:22,690 --> 14:24:27,659
        Eric Wulff: We just saw that if we scale up our model

        815
        14:24:27,690 --> 14:24:40,729
        Eric Wulff: significantly, so make make the model larger. With many more parameters we can easily improve the physics performance. Um. So we just try that the

        816
        14:24:41,300 --> 14:24:44,330
        Eric Wulff: this week,

        817
        14:24:44,660 --> 14:24:47,859
        Eric Wulff: because we were curious. Basically Uh, however,

        818
        14:24:47,920 --> 14:24:49,790
        Eric Wulff: that's sort of not a

        819
        14:24:58,390 --> 14:25:02,050
        Eric Wulff: quickly enough in production, anyway.

        820
        14:25:02,590 --> 14:25:03,639
        Eric Wulff: Um,

        821
        14:25:06,150 --> 14:25:08,350
        Eric Wulff: but it sort of shows that the

        822
        14:25:08,440 --> 14:25:17,159
        Eric Wulff: there is enough information in the data to do better. We just uh need to improve the model or the the training of the model somehow.

        823
        14:25:20,160 --> 14:25:25,100
        Enrico Fermi Institute: Okay, Um, see, you have your hand raised.

        824
        14:25:25,830 --> 14:25:42,530
        Shigeki: Uh, yeah, I just have a question on in terms of the amount of data you're going through, and the model size. Uh, I guess that's measured in terms of number of parameters as well as hyper parameters. And whether or not This Is Is there a Is there a a a a

        825
        14:25:42,540 --> 14:25:54,120
        Shigeki: size that that physics problems, and in atp tend to gravitate to, or it can be all over the map in terms of model size and data, set size and and number of hyper parameters.

        826
        14:25:55,040 --> 14:25:56,179
        Eric Wulff: Um!

        827
        14:25:56,320 --> 14:26:00,129
        Eric Wulff: So the number of heavy parameters. Um,

        828
        14:26:00,190 --> 14:26:07,620
        Eric Wulff: that's a little bit arbitrary, dependent on what you mean with have parameters. So if you

        829
        14:26:08,040 --> 14:26:10,180
        Eric Wulff: uh if you count

        830
        14:26:10,250 --> 14:26:11,389
        Eric Wulff: well,

        831
        14:26:11,430 --> 14:26:13,889
        Eric Wulff: you you you can configure

        832
        14:26:14,040 --> 14:26:23,330
        Eric Wulff: but very many things with our model. So if you, if you count all those hyper parameters, I don't know how many they are, but there are hundreds, and we don't two, not of them, because they're too many.

        833
        14:26:28,100 --> 14:26:33,720
        Eric Wulff: Uh, the the number of trainable parameters in the model is around one million,

        834
        14:26:34,130 --> 14:26:37,850
        Eric Wulff: so that's fairly small, if you

        835
        14:26:37,890 --> 14:26:39,450
        Eric Wulff: compared with other uh

        836
        14:26:40,090 --> 14:26:46,880
        Eric Wulff: other sciences, like image recognition, or natural language processing, then this is really a small model.

        837
        14:26:47,030 --> 14:26:48,389
        Eric Wulff: Um!

        838
        14:26:48,570 --> 14:26:50,480
        Eric Wulff: How we we think that

        839
        14:26:50,580 --> 14:26:52,679
        Eric Wulff: I I actually don't know the

        840
        14:26:53,190 --> 14:26:57,809
        Eric Wulff: the memory requirements that we have to uh

        841
        14:26:57,850 --> 14:27:05,289
        Eric Wulff: that here, too, if this would go into production at some point in the future. But I don't think we could go much larger

        842
        14:27:05,410 --> 14:27:19,759
        Eric Wulff: uh, at least not without uh doing some kind of conversation. Uh, we're training or post training, conversation, or perhaps pruding weights after training or doing some other tricks like that

        843
        14:27:19,990 --> 14:27:23,109
        Eric Wulff: uh data set size. So the

        844
        14:27:23,680 --> 14:27:26,389
        Eric Wulff: the one we are currently using.

        845
        14:27:30,540 --> 14:27:34,559
        Eric Wulff: I think it's around four hundred thousand events

        846
        14:27:35,000 --> 14:27:38,260
        Eric Wulff: collision events of of the the different kinds.

        847
        14:27:40,140 --> 14:27:44,790
        Shigeki: Do you have an approximate idea of how much actual gigabytes that is?

        848
        14:27:45,140 --> 14:27:46,559
        Eric Wulff: Um

        849
        14:27:47,210 --> 14:27:48,730
        Shigeki: auto-

        850
        14:27:49,250 --> 14:27:51,920
        Eric Wulff: is a few hundred gigabytes

        851
        14:27:52,100 --> 14:27:54,480
        Eric Wulff: less than a thousand,

        852
        14:27:55,530 --> 14:28:08,920
        Shigeki: and presumably when you're when you're running this, it's, it's it's it's it's compute bound not not not uh a I o bound from from uh in terms of feeding the they uh, the the training data,

        853
        14:28:08,950 --> 14:28:11,229
        Shigeki: or it depends.

        854
        14:28:11,450 --> 14:28:18,439
        Eric Wulff: No, I would say it's compute bound. Oh, you mean looking at the Gpu utilization. It goes to

        855
        14:28:18,590 --> 14:28:20,070
        Eric Wulff: it close to one hundred.

        856
        14:28:20,139 --> 14:28:22,229
        Shigeki: Mhm Okay, thanks.

        857
        14:28:22,559 --> 14:28:27,009
        Enrico Fermi Institute: And you know how much of the memory and the Gpu you're using, or have you?

        858
        14:28:27,570 --> 14:28:30,279
        Eric Wulff: Uh, yes, we uh we,

        859
        14:28:30,400 --> 14:28:33,209
        Eric Wulff: you see, all of it. Basically

        860
        14:28:34,049 --> 14:28:40,529
        Enrico Fermi Institute: So then you're not. It would not help you to have centers that chop up these big gpus.

        861
        14:28:41,969 --> 14:28:45,769
        Eric Wulff: I don't think so. Um. So there is a problem.

        862
        14:28:45,930 --> 14:28:57,160
        Eric Wulff: Um, with having two large batch sizes sometimes. Um basically in order to fill up the gpu. You you increase the bad size as you use for training.

        863
        14:28:57,230 --> 14:28:58,449
        Eric Wulff: Um,

        864
        14:28:59,530 --> 14:29:05,829
        Eric Wulff: and that means you can push more date, though,

        865
        14:29:05,850 --> 14:29:14,719
        Eric Wulff: through per time units, but you know it. It doesn't necessarily mean you can do more optimization steps. So you you might not

        866
        14:29:14,879 --> 14:29:17,020
        Eric Wulff: uh reach

        867
        14:29:17,160 --> 14:29:20,090
        Eric Wulff: the same accuracy quicker.

        868
        14:29:26,629 --> 14:29:38,190
        Eric Wulff: It's it's not obvious or so it's always the case that you can just uh throw more memory at it than it helps. Yeah, I was actually thinking of swapping it the other way with. Uh,

        869
        14:29:38,990 --> 14:29:45,470
        Enrico Fermi Institute: we have a question in our data center of how much we should chop up using Meg the a one hundreds,

        870
        14:29:47,480 --> 14:29:50,440
        Enrico Fermi Institute: you know. Give person a whole

        871
        14:29:51,010 --> 14:29:54,830
        Enrico Fermi Institute: eighty gigs. Were split it up two ways or four ways

        872
        14:29:55,139 --> 14:30:03,550
        Eric Wulff: uh to to several users at the same time.

        873
        14:30:05,549 --> 14:30:06,580
        Enrico Fermi Institute: Thanks.

        874
        14:30:07,530 --> 14:30:09,519
        Enrico Fermi Institute: Show another comment:

        875
        14:30:12,860 --> 14:30:17,950
        Enrico Fermi Institute: Sorry I got to the

        876
        14:30:18,650 --> 14:30:27,329
        Dirk: yeah. I I had a question, and it's not so much. I mean, Eric, if you know you can answer, but it's more uh looking at broader,

        877
        14:30:27,559 --> 14:30:38,899
        Dirk: the and more broader impact of that, and follow on because this is this is interesting, and this is on the But What's the next step? Have there been any discussions how

        878
        14:30:38,969 --> 14:30:41,610
        Dirk: to integrate this in like?

        879
        14:30:41,700 --> 14:30:58,269
        Dirk: Eventually? You You said It's work. It's improving particle. Flow. So eventually it should feed back into the Re. How we run the the reconstruction? Basically, And then the question comes, uh, what, how would you actually deploy this? How often do you have to run it?

        880
        14:30:58,540 --> 14:31:19,770
        Dirk: How long does it take? And and how often do I have to renew like, renew it Basically, with new data to to check that the parameters are still okay and has has, and it's not just a question about the specific thing that So this is like the larger questions. Maybe Lindsay or I don't know if Mike might ask for Link connected if there have been any

        881
        14:31:19,780 --> 14:31:26,789
        Dirk: discussions of that already, or or if that's still to come after the on. The initial on D is done.

        882
        14:31:30,130 --> 14:31:33,150
        Eric Wulff: Well, I would say, if uh,

        883
        14:31:33,470 --> 14:31:36,980
        Eric Wulff: if we are able to prove, or

        884
        14:31:37,030 --> 14:31:38,920
        Eric Wulff: somehow show, that

        885
        14:31:39,020 --> 14:31:43,090
        Eric Wulff: this machine learned approach to particle flow works

        886
        14:31:43,170 --> 14:31:44,490
        Eric Wulff: uh

        887
        14:31:44,880 --> 14:31:52,579
        Eric Wulff: as well, but more efficiently, or or even uh better than the uh

        888
        14:31:52,610 --> 14:31:54,660
        Eric Wulff: method that are used at the moment.

        889
        14:31:55,670 --> 14:31:59,449
        Eric Wulff: Um. Then we then we sort of free that model and

        890
        14:31:59,690 --> 14:32:04,779
        Eric Wulff: get it into production, and then we shouldn't need to redo any hyper,

        891
        14:32:04,820 --> 14:32:34,339
        Dirk: current documentation or anything like that. Then it, you know we Then it's like having a finished algorithm, that. Just Yeah. But the data taking the detector changes all the time. So who knows if the twenty right. If if the training you did on two thousand and twenty-two data, or even run two data is still valid for your next set of data. That's right. So we're we're not. We're not training on date, but we're trying a simulation. Okay, right. But but I think this is when we talk about these kind of a problems, and one of things needs to be studied

        892
        14:32:34,580 --> 14:32:44,590
        Ian Fisk: is how stable these are, and whether they really like, cause it could be that we're incredibly lucky, and they once you hype once you do the hyper parameter optimization that it's applicable to

        893
        14:32:45,180 --> 14:32:51,009
        Ian Fisk: small changes in data. Um, And one thing that this I think we can see from Eric's plots is that it?

        894
        14:32:51,050 --> 14:33:01,189
        Ian Fisk: It makes these things faster. They train faster and better after they can optimize. And so if we were in unreasonably lucky, they'll actually save us resources.

        895
        14:33:02,360 --> 14:33:03,300
        Okay,

        896
        14:33:03,500 --> 14:33:08,860
        Dirk: Okay. But it sounds like It's a discussion that's still to come. That's not. We're not quite there yet.

        897
        14:33:09,400 --> 14:33:25,109
        Ian Fisk: Well, I think so. I think the the thing we do is we given how much this improves the situation where chances are. And I think this is applies to multiple science fields, not just ourselves, that we should be factoring these things in in our discussion about how we're going to use Hc.

        898
        14:33:25,140 --> 14:33:35,829
        Ian Fisk: Um for the report. Um! And then we'll have to wait and see as to whether this thing that's a a workful that we're constantly running, or one that we are running once in a while.

        899
        14:33:39,190 --> 14:33:47,179
        Mike Hildreth: Yeah, I guess I would agree with that. Um, I don't. Yeah, we haven't had A. We. We don't have enough data.

        900
        14:33:47,840 --> 14:33:53,670
        Mike Hildreth: How often we're going to have to train these. But this use case is certainly in the planning.

        901
        14:33:54,080 --> 14:33:55,760
        Enrico Fermi Institute: Is it right?

        902
        14:33:55,850 --> 14:34:07,809
        Enrico Fermi Institute: I think the one remaining worry is, we haven't been through like a complete recalibration cycle of the detector. Uh uh, after a stop or anything like that to see if

        903
        14:34:07,820 --> 14:34:21,400
        Enrico Fermi Institute: to see if it or to see how robust a single training is, or the most optimal training is. With respect to the changing parameters of the detector, and it's just something we have to find out. But it's not going to change the pattern. All that much to be honest.

        904
        14:34:21,410 --> 14:34:28,360
        Enrico Fermi Institute: But yeah, I agree with the in here. It's this: This is probably going to save us resources as well in the long run.

        905
        14:34:28,620 --> 14:34:30,320
        Dirk: Okay, thanks.

        906
        14:34:30,510 --> 14:34:38,550
        Dirk: That makes it difficult for us to write because we can write the use case in, but it's extremely hard to attach any numbers to it at the moment.

        907
        14:34:41,470 --> 14:34:55,099
        Enrico Fermi Institute: Yeah, I mean, we, I guess, to another way to summarize it. We've shown that this works, and that we can get really great results out of it, but we haven't understood the true uh, you know, steady state operational parameters of of this system.

        908
        14:34:59,230 --> 14:35:04,370
        Eric Wulff: And just to be clear like you, there there still needs to be a

        909
        14:35:04,610 --> 14:35:08,699
        Eric Wulff: quite a bit of work before this would be ready to go into production.

        910
        14:35:09,140 --> 14:35:10,600
        Eric Wulff: It's still

        911
        14:35:10,880 --> 14:35:14,050
        Eric Wulff: uh like we, we, we don't understand

        912
        14:35:14,200 --> 14:35:18,509
        Eric Wulff: all the properties of how it reconstructs particles well enough. Yet,

        913
        14:35:20,650 --> 14:35:23,980
        Eric Wulff: although you know it's looking good, it's. It's looking promising,

        914
        14:35:24,230 --> 14:35:30,350
        Eric Wulff: but it it needs to be validated and much more before production.

        915
        14:35:41,060 --> 14:35:44,129
        Enrico Fermi Institute: So we have more question, for

        916
        14:35:46,660 --> 14:35:50,649
        Enrico Fermi Institute: I guess one silly question

        917
        14:35:51,140 --> 14:36:03,900
        Enrico Fermi Institute: in terms of actually trying to use this like in Cmssw. And this is mostly because I don't remember the last time that Joseph presented this, How fast does this go per event in inference mode?

        918
        14:36:04,220 --> 14:36:06,810
        Enrico Fermi Institute: How many, what's the throughput look like?

        919
        14:36:06,940 --> 14:36:24,380
        Eric Wulff: Um, I don't think we have done anything there that would be comparable to it. Production? So it um, or maybe an even better question is, what's what's the memory footprint look like on Gpu or Cpu

        920
        14:36:24,770 --> 14:36:31,000
        Eric Wulff: uh, I don't know that on top of my head, but I know we have a plot somewhere that I can

        921
        14:36:31,100 --> 14:36:32,899
        Enrico Fermi Institute: all good. Thank you.

        922
        14:36:37,540 --> 14:36:46,069
        Enrico Fermi Institute: Okay. There are no other questions, and we can. You can move on. Ah, thank you very much for the presentation, Eric.

        923
        14:36:46,360 --> 14:36:48,119
        Eric Wulff: No problem. Thanks for listening.

      • 13:20
        Impacts of Expanded HPC/Cloud Use 20m

        [Eastern time]

         

        929
        14:37:45,550 --> 14:37:49,740
        Dirk: So figure out how to use this. So we were at impact. Right?

        930
        14:37:49,770 --> 14:37:50,810
        Enrico Fermi Institute: Yeah,

        931
        14:37:50,880 --> 14:37:56,380
        Dirk: I I can say a little bit something about that. And I think we we discussed some of that yesterday already,

        932
        14:37:56,630 --> 14:38:01,189
        Dirk: but it's it's also including cloud. Now. So we're looking at both. Um.

        933
        14:38:01,450 --> 14:38:19,450
        Dirk: So what happens if we actually start using a lot of Hpc. And cloud users uh how the integration with I mean, at the moment we run them opportunistically. So they are considered an add on. But if we ever get to a point where they like a large fraction of our overall resources.

        934
        14:38:19,660 --> 14:38:37,420
        Dirk: What's the impact on our on our global computing infrastructure? And how does it impact our own, the owned resource that that are still in the mix. So it's basically you. You can look at that. And you basically would have a lot of compute external to our own resources in some way.

        935
        14:38:37,470 --> 14:38:43,410
        Dirk: And then you you look at What does that mean for our own sides? What kind of changes

        936
        14:38:44,440 --> 14:39:02,069
        Dirk: might potentially be needed there to facilitate large scale cloud, and to a large degree that will dispense on on on how much are we actually using the storage at Cloud to the Hpc. So if you, if you consider that you don't have any storage, and you have to stream, or some other way to get the data

        937
        14:39:02,110 --> 14:39:09,129
        Dirk: in and out quickly and just process it on demand that that puts more pressure on our own sides. Well versus

        938
        14:39:09,150 --> 14:39:22,100
        Dirk: if you look at what? Atlas, that they have a self-contained site. That's more that follows more. The the model of just bring up another side somewhere else on some external resources. But it's kind of mostly a self-contained sun.

        939
        14:39:22,530 --> 14:39:23,539
        Dirk: Um!

        940
        14:39:23,700 --> 14:39:42,489
        Dirk: The other impact is that if you want to, if if we decide tomorrow, for instance, that um our codes performs great on arm, and we should uh switch to it as much as possible, because it's more cost effective. You can actually do that much quicker on the cloud, like, for instance, for that Google side in principle

        941
        14:39:43,130 --> 14:39:59,530
        Dirk: at this could decide tomorrow that oh, from now on we're providing arm, cpu or not into cpu anymore. Um, because you change the instance type. Um! You can't do that on our own resources. That's a much longer process of a multiple years to swap out resources.

        942
        14:39:59,560 --> 14:40:01,730
        Dirk: And uh, yeah,

        943
        14:40:01,830 --> 14:40:21,730
        Dirk: And the the other obvious issues is even if we get storage at the Uh Cloud Hbc: sites you have to uh worry about transfers, because all these these resources need to be integrated in our transfer infrastructure. We need to, uh have Rosie be able to connect somehow. Uh, maybe uh

        944
        14:40:22,480 --> 14:40:23,449
        Dirk: have

        945
        14:40:24,520 --> 14:40:35,929
        Dirk: it. It mentions in intermediary node services. I know B. And L. Has some global online endpoint that atlas to facilitate transfers to some Hbc. And things like that, so

        946
        14:40:36,540 --> 14:40:52,250
        Dirk: that feeds directly into the last point, feeds directly into network integration. So it's not just the transfer services, but also the underlying transfer fabric, the the network connectivity of of the Cloud and Hpc. Sites on Hbc. Resources.

        947
        14:40:57,530 --> 14:41:07,960
        Dirk: As I said, we discussed it yesterday, some of it already. And uh, the one comment was that we should break out hardware and service costs that are basically

        948
        14:41:08,930 --> 14:41:11,800
        Dirk: so anything else, any other comments on this

        949
        14:41:17,770 --> 14:41:20,830
        Enrico Fermi Institute: one of the things that we had talked about in our,

        950
        14:41:21,430 --> 14:41:33,769
        Enrico Fermi Institute: you know, just discussions among the blueprint group. Uh, you know, before the the workshop here was Is there any impact on

        951
        14:41:34,040 --> 14:41:46,960
        Enrico Fermi Institute: on grid sites? If we were to, you know, do something like shift, you know large amounts of certain kinds of workflows to cloud like we did a lot of,

        952
        14:41:46,980 --> 14:41:53,320
        Enrico Fermi Institute: you know, a lot more simulation on which Pc. We have to, you know,

        953
        14:41:53,540 --> 14:42:01,009
        Enrico Fermi Institute: with the with the tier twos run correspondingly more analysis or something like that? If that were the case, would they have to

        954
        14:42:01,330 --> 14:42:04,189
        Enrico Fermi Institute: up their facilities in certain ways,

        955
        14:42:05,200 --> 14:42:12,550
        Enrico Fermi Institute: or does that not make sense at all. Should we just anticipate that We'll be able to run all workload types and all all resources,

        956
        14:42:13,990 --> 14:42:15,060
        things like that?

        957
        14:42:16,390 --> 14:42:19,609
        Enrico Fermi Institute: I see There's a hand raised from Eric

        958
        14:42:34,640 --> 14:42:37,089
        Eric Lancon: to export um

        959
        14:42:37,220 --> 14:42:39,349
        Eric Lancon: the Cpu processing

        960
        14:43:17,160 --> 14:43:18,999
        Eric Lancon: at the same site.

        961
        14:43:23,900 --> 14:43:40,379
        Dirk: Yeah, that's something we we we worried about because the the impact on the data transfers for formula specifically, because if you look at how we designed the the Hep cloud where we basically treat the Hpc. As an external compute resource, and then most of the

        962
        14:43:40,540 --> 14:43:53,389
        Dirk: the Dio and the data it actually goes through Fermi lab that this will. So far everything is holding up nicely, but eventually, as we scale up Hpc: use. There's probably going to be an impact on

        963
        14:43:53,480 --> 14:43:58,259
        Dirk: on provisioning of of network and storage at at formula

        964
        14:44:22,190 --> 14:44:23,250
        Um.

        965
        14:44:23,340 --> 14:44:25,430
        Enrico Fermi Institute: Other comments on

        966
        14:44:25,660 --> 14:44:30,349
        Enrico Fermi Institute: impacted Hpc Cloud use on the existing infrastructure.

        967
        14:44:37,250 --> 14:44:39,300
        Steven Timm: I just say what I heard

        968
        14:44:39,420 --> 14:44:42,249
        Steven Timm: you might not think about.

        969
        14:44:42,480 --> 14:44:43,560
        Steven Timm: Uh.

        970
        14:44:43,730 --> 14:44:48,970
        Steven Timm: This was not a Cms. To us. This was, but we were running a very

        971
        14:44:49,080 --> 14:44:58,119
        Steven Timm: um. And then, calling with the Google Code for a newference server, we managed to saturate the same network before we live for short time between us and Google.

        972
        14:44:59,330 --> 14:45:02,110
        Steven Timm: So uh, you can.

        973
        14:45:02,280 --> 14:45:06,529
        Steven Timm: If you're doing inference, you have to be careful of your um.

        974
        14:45:17,080 --> 14:45:29,849
        Enrico Fermi Institute: I have what is possibly a profoundly uninformed question, how much of our, how much of our Monte Carlo generation at the at the actual generator level is being

        975
        14:45:29,860 --> 14:45:38,420
        Enrico Fermi Institute: done uh or well. But what is uh taking place on Gpus like using using Gpus to do the Monte Carlo integration, and i'm waiting

        976
        14:45:40,460 --> 14:45:56,779
        Enrico Fermi Institute: because that is a significant fraction of time that we spend right now. I mean, that's what uh Alison and Cms zero, because uh, I mean a very quick search on the Internet informs us that uh,

        977
        14:45:56,790 --> 14:46:15,759
        Enrico Fermi Institute: they could, or so one that Gpu and Monte Carlo integration has been around for ten more than ten years now, and to that the factors speed up for that integration is like a factor of fifty or something. Uh, though of course, this probably depends on the shape of the thing that you're integrating, and how many polls it has and whatnot.

        978
        14:46:15,870 --> 14:46:24,899
        Enrico Fermi Institute: But has anyone looked at benchmarking that, And could it have a major impact if we could significantly reduce the

        979
        14:46:25,020 --> 14:46:28,389
        Enrico Fermi Institute: the time to integrating

        980
        14:46:28,420 --> 14:46:37,380
        Enrico Fermi Institute: time to getting an integrated cross-section, and then also the time to unleading the necessary whatever necessary amounts of events.

        981
        14:46:37,460 --> 14:46:48,019
        Enrico Fermi Institute: And could that fit on the hpc resources better. Could we use that in any way? I'm not sure that after that this goes really open-ended but it seems like It's something we're not considering

        982
        14:46:48,150 --> 14:46:54,739
        Enrico Fermi Institute: because it's a It would be a really nice way to hide a lot of the latency in our production workloads right now,

        983
        14:46:55,120 --> 14:46:57,210
        Enrico Fermi Institute: so get rid of it, not even Highland

        984
        14:47:00,660 --> 14:47:13,370
        Enrico Fermi Institute: Uh. Had. Yeah, this this was a really open, ended question. But have we have we looked at that? And uh, if we're not doing it now. After ten years There must be something wrong,

        985
        14:47:13,420 --> 14:47:20,730
        Dirk: maybe, Lindsay, but you and Mike should be in the best position to be able to answer that question in terms of

        986
        14:47:21,190 --> 14:47:25,329
        Enrico Fermi Institute: for something that is that old.

        987
        14:47:25,360 --> 14:47:31,520
        Enrico Fermi Institute: There's either something wrong with it, or we've actually just not been paying attention to it for a decade.

        988
        14:47:31,530 --> 14:47:46,870
        Enrico Fermi Institute: Um. So I have. Yeah, and I I I personally don't have any information on that, Mike. Do you have anything? I think the answer is zero as well, you know. So why why are we using this? That's kind of a weird one?

        989
        14:47:47,170 --> 14:47:49,719
        Steven Timm: There's been studies recently that almost

        990
        14:47:49,890 --> 14:48:01,270
        Steven Timm: the dominant part of generation is actually throwing the dates and rolling around numbers. But I don't know if that's true for him as a but I know it's through for doing so. I mean, could you envision a situation where you're

        991
        14:48:01,280 --> 14:48:16,659
        Enrico Fermi Institute: It's either. So any random numbers for you and nothing else. Uh yeah, I mean, that's that's probably what a large portion of it is that they're throwing lots of random numbers in parallel. Um! They have very good money, or very good uh Rngs uh for for Gpus.

        992
        14:48:17,070 --> 14:48:35,200
        Dirk: I I think the the question also goes a little bit out of scope, because we're not supposed to look into what's going on on the framework side and the software side, but maybe to to to to, and I mean from from the conversation I had with Muddy on a lot of the the We had this the effort to spend in terms of Gpu.

        993
        14:48:35,210 --> 14:48:39,789
        Dirk: I think it's this: The simple answer is, we looked at the full chain.

        994
        14:48:40,240 --> 14:48:51,369
        Dirk: Jen Sim, did you recall, plus whatever miscellaneous comes after? And then they decided that generation is not the primary target of

        995
        14:48:51,730 --> 14:49:02,819
        Dirk: a porting effort, because it's not over all that important for us. It's less important than reconstruction and tracking. I mean It's just lowest hanging fruit,

        996
        14:49:03,210 --> 14:49:16,139
        Dirk: and the picture changes, of course, depending on generator to generator. But that I think that's That's the simple answer. No effort. Focus on certain areas, and that's one of one of them that wasn't focused on.

        997
        14:49:16,150 --> 14:49:34,669
        Enrico Fermi Institute: Yeah, I can see that. That's a reasonable answer, I guess. Uh looking at kind of the the shape of the compute facilities that we are getting from Hpc. Uh. Packaging up some huge job that you uh, you know, send out to an Hpc. And then get your a lot of time and get your answer back. Uh. It seems

        998
        14:49:34,680 --> 14:49:46,849
        Enrico Fermi Institute: at least in terms of like the the the geometry or the topology of the that makes a lot more sense for the kind of resources we're talking about. But I understand that Rico is certainly a higher priority in terms of compute that

        999
        14:49:49,940 --> 14:49:53,110
        Enrico Fermi Institute: that's that's sort of where my thinking is heading, that's all,

        1000
        14:49:57,580 --> 14:49:59,379
        Enrico Fermi Institute: Steve. Did you have another comment?

        1001
        14:50:00,440 --> 14:50:01,420
        Steven Timm: No

        1002
        14:50:09,860 --> 14:50:13,999
        Enrico Fermi Institute: other comments here. Or should we move on to to network integration?

        1003
        14:50:21,620 --> 14:50:24,489
        Enrico Fermi Institute: Okay, Sounds like we should we should move on

      • 13:40
        Network Integration HPC/Cloud 20m

        [Eastern time]

         

        Network / Site connectivity slides

         

        1006
        14:50:30,210 --> 14:50:51,389
        Enrico Fermi Institute: So yeah, one of the things we wanted to talk about was uh, you know just how how our sheer ones your choose are connected uh today, and sort of what the the the the plans are for that in the future. Um! Some of the some of the forward-looking stuff, um! And then we'll also have a presentation from from uh dale cart vesnet

        1007
        14:50:51,400 --> 14:50:59,409
        Enrico Fermi Institute: um to give us some of his thoughts as well. Yeah, One of the questions that comes up here is

        1008
        14:51:00,220 --> 14:51:08,870
        Enrico Fermi Institute: especially with the clouds. What can we do about connecting things to Lhs. You want and hearing all this business

        1009
        14:51:08,990 --> 14:51:14,459
        Enrico Fermi Institute: people like to talk about address costs. Is there anything any quick and easy thing we can do to reduce those

        1010
        14:51:14,770 --> 14:51:15,980
        Enrico Fermi Institute: um.

        1011
        14:51:16,900 --> 14:51:19,919
        Enrico Fermi Institute: So for site, connectivity? Um

        1012
        14:51:19,990 --> 14:51:32,529
        Enrico Fermi Institute: for Cms one hundred gigabit all the your two sites turn our gate with to to Fermi lab evolution of the of us-based site connectivity. There's plans to demonstrate

        1013
        14:51:32,690 --> 14:51:46,870
        Enrico Fermi Institute: over a hundred gigabit uh transfers of two thousand and twenty-three sensitive plans to have tier two's at four hundred gigabits in two thousand and twenty-five uh for me. Lab has plans for upgrades, but they're taking sort of a year by your approach. I don't know dirt if you want to add anything else to that.

        1014
        14:51:48,140 --> 14:51:59,569
        Dirk: No, that's basically it. I mean it. It's a little. All these plans are kind of tended. If we know we have to upgrade to get to H. And let it see, and it's going to be a process. But the exact schedule is a bit

        1015
        14:51:59,660 --> 14:52:02,299
        Dirk: and and undefined at the moment,

        1016
        14:52:02,330 --> 14:52:05,130
        Enrico Fermi Institute: and I should say that a lot of these

        1017
        14:52:05,250 --> 14:52:07,310
        Enrico Fermi Institute: plans were

        1018
        14:52:07,500 --> 14:52:09,719
        Enrico Fermi Institute: not said so, but there are.

        1019
        14:52:10,130 --> 14:52:29,889
        Enrico Fermi Institute: The plans were developed before the slip of the La, the agency schedule. So, um, you know i'd be willing. We're already talking about maybe pushing the the demonstration of greater than one hundred greater than one hundred gigabit transfers to twenty-four. Um. So now that we have a couple of more years, we're probably going to shift things back a bit

        1020
        14:52:32,770 --> 14:52:45,550
        Enrico Fermi Institute: on the Atlas side. Yeah, So we have a hard view, but it says a few, but it's really most of the tier two are basically at or near one hundred gigabits, some somehow more than

        1021
        14:52:45,560 --> 14:52:59,419
        Enrico Fermi Institute: hundreds of two by one hundred things like that. Um, The tier one, so I understand, has at least four by one hundred gigabit. So if i'm i'm just representing any of the sites. Just jump out and correct me. Um. And yeah,

        1022
        14:52:59,430 --> 14:53:15,340
        Enrico Fermi Institute: our expectation is that, you know, in the future we'll we'll have multiple hundred gigabytes of connectivity. Um, you know, one or more site may have uh four hundred gigabit links that I think a lot of it depends on the on the economics of uh of when it's sensible to. Uh to start buying four hundred.

        1023
        14:53:17,980 --> 14:53:24,780
        Enrico Fermi Institute: Yes, that plans. Yeah. So I think we now we can. We can jump to to Dale's presentation. If you're you're out there. Jail.

         

        ESNET presentation

        1024
        14:53:25,670 --> 14:53:31,499
        Enrico Fermi Institute: Yes, sounds good. Okay, great. I'm going to stop share here and you can. You can start your share.

        1025
        14:53:35,730 --> 14:53:49,210
        Dale Carder: All Righty. All thanks for having me here today and feel free to to interrupt. I like this interactive approach a lot more than me, just uh preaching. So it's kind of got um an overview of

        1026
        14:53:49,220 --> 14:53:56,039
        Dale Carder: sort of the do we? Uh networking perspective on Hpc. Facilities. Tier one,

        1027
        14:53:56,070 --> 14:53:59,089
        Dale Carder: and then we'll get into some cloud stuff, and then

        1028
        14:53:59,430 --> 14:54:05,269
        Dale Carder: then I sort of trail off into where I have more questions than answers, which is, I guess, not surprising,

        1029
        14:54:05,460 --> 14:54:08,239
        Dale Carder: given What? Where some of these conversations have been.

        1030
        14:54:08,740 --> 14:54:15,859
        Dale Carder: So the biggest thing I want to emphasize with respect to not just where we are now. But you know,

        1031
        14:54:16,190 --> 14:54:35,530
        Dale Carder: through sort of the timeline between now and the beginnings of high luminosity. Lhc: And we we had this big, you know, uh process to build. Yes, net six, and some of the key components included, building our physical network into each Doe National lab,

        1032
        14:54:35,670 --> 14:54:48,760
        Dale Carder: and that means our fiber extends in there with our equipment collocated at the site um with Routers that we run there, so we can offer essentially any. Yes, net service at any national lab at full scale.

        1033
        14:54:49,880 --> 14:55:08,229
        Dale Carder: It's also. Now, the Esnet owns the optical equipment, and basically the end to end connectivity extremely cost effective uh to upgrade it. It's not going out and procuring uh circuits from vendors. You know things along that line. We're doing all of our optical engineering in house

        1034
        14:55:08,240 --> 14:55:14,650
        Dale Carder: now, so we can go out and buy modems from any vendor off the shelf, and to put them under our network after we qualify them.

        1035
        14:55:14,820 --> 14:55:24,800
        Dale Carder: So it's sort of a very different evolution model from sort of the traditional backbone approach of buying circuits and and linking things together on hot by Hop!

        1036
        14:55:25,670 --> 14:55:27,000
        Dale Carder: Um!

        1037
        14:55:27,070 --> 14:55:40,660
        Dale Carder: There was already a little bit showing of sort of like where we were at connectivity. Wise um for each of the Lcfs and nurse basically we're right now at this precipice of going from

        1038
        14:55:40,730 --> 14:55:59,839
        Dale Carder: um and by one hundred get connectivity to four hundred gig and class connectivity. Um sort of everyone's got a little bit slightly different timeline depending on on large part due to equipment, shortages and things of that sort, but generally across the the big deal we facilities. This is all kind of happening in parallel.

        1039
        14:55:59,880 --> 14:56:17,669
        Dale Carder: Um, There's yesterday. There's a lot of talk about nurse being sort of different than the Lcs. Which is fair. Um, they're targeting one terabit per second um basically into their their facility, and that's that's not through um the lab that's direct. And for me it's not to nurse

        1040
        14:56:18,780 --> 14:56:33,380
        Dale Carder: where I think this puts us, and you know we're like. At least I want to be is that the limiting factor is going to be at the site, you know, if we can basically show up to the door of Fermi Lab, or show you the door of

        1041
        14:56:33,470 --> 14:56:37,449
        Dale Carder: nerves wherever with essentially all you can eat connectivity you,

        1042
        14:56:37,800 --> 14:56:41,620
        Dale Carder: it's now onto the Border Router security junk

        1043
        14:56:41,990 --> 14:56:44,800
        Dale Carder: data transfer nodes and storage

        1044
        14:56:45,030 --> 14:56:49,839
        Dale Carder: where the scaling factors are going to be. Not necessarily the wide area now.

        1045
        14:56:50,640 --> 14:56:56,309
        Dale Carder: So that's that's sort of where where I think we're going to be, at least in the next

        1046
        14:56:56,520 --> 14:57:01,299
        Dale Carder: couple of years. We've got a long life cycle on, especially the optical network that we've built

        1047
        14:57:02,380 --> 14:57:05,220
        Dale Carder: there any questions sort of on this front,

        1048
        14:57:05,860 --> 14:57:08,820
        Dale Carder: and I think we will drift off into cloud stuff.

        1049
        14:57:10,580 --> 14:57:15,270
        Enrico Fermi Institute: So when is the four hundred Gigabit stuff expected to become

        1050
        14:57:20,030 --> 14:57:37,240
        Dale Carder: economical. So it's a funny term, but uh, it's almost more about availability right now. Can you buy equipment or not? And in some cases you can actually only buy the newer equipment because it's It's smaller uh fab sizes that are actually being produced

        1051
        14:57:37,250 --> 14:57:51,110
        Dale Carder: versus the larger tabs where you're competing with chips for dishwashers and things like that. So it's it's It's sort of this funny funny point. But in our in our conversations with

        1052
        14:57:51,300 --> 14:58:06,020
        Dale Carder: um I think we're up to like sixty or seventy of the tier two's. Nearly everyone has a has a plan for the next couple of years. It's either like next year or or right after that. So we're pretty much right at that point. Now,

        1053
        14:58:06,730 --> 14:58:24,369
        Dale Carder: a lot of that's driven by. You know the economics of these major um cloud data centers. So if you can buy equipment, sort of, you know, matching what the industry as a whole is buying, you're going to reap the words of that cost effectiveness. There

        1054
        14:58:25,910 --> 14:58:41,179
        Enrico Fermi Institute: is. Is there a concern that, like I know it doesn't apply for a lot of sites, but for for things like uh like firewalls and things like that. I know a lot of some. Some sites are more concerned about that than others like are the firewall appliances sort of keeping pace with the

        1055
        14:58:41,580 --> 14:58:54,109
        Dale Carder: i'll say no. I don't think there's truly been a demonstrated track record of that.

        1056
        14:58:54,420 --> 14:58:56,780
        Dale Carder: You know we still see

        1057
        14:58:57,000 --> 14:58:59,720
        Dale Carder: traffic compound what

        1058
        14:59:00,380 --> 14:59:03,320
        Dale Carder: forty ish annually

        1059
        14:59:03,380 --> 14:59:22,169
        Dale Carder: Um! Those firewalls and middle boxes are designed for typically administrative workloads where end of the day there's only so much data where all you guys sitting on your laptops in the conference room are gonna be competing for resources. That's very different. Right then. Uh it's scientific computing.

        1060
        14:59:22,180 --> 14:59:27,799
        Dale Carder: So there's things you know. Yes, kind of sort of worked on on that team such as like the Science team, Z model for

        1061
        14:59:27,830 --> 14:59:33,430
        Dale Carder: you know how to place resources at a site, how to change the perimeter architecture to better accommodate the

        1062
        14:59:33,540 --> 14:59:49,789
        Dale Carder: um data intensive sciences. So there there's there's opportunities there. But you know I I still don't see a world where you would go for a or could cost effectively deploy and off the shelf like Firewall Middlebox.

        1063
        14:59:50,840 --> 14:59:51,830
        Okay,

        1064
        14:59:52,530 --> 15:00:11,389
        Dirk: I'd love to be proven wrong. So can please do. Yeah, yeah, I had to comment on the the last line on this slide where you said Vice- white disparity in hp support for data centric workforce. I mean. We discussed that yesterday to a lot, and we know where I was

        1065
        15:00:11,400 --> 15:00:14,199
        Dirk: curious was If this actually

        1066
        15:00:14,240 --> 15:00:17,059
        Dirk: has an impact on how the

        1067
        15:00:17,370 --> 15:00:33,970
        Dirk: these Hpc. Facilities approach that building up their external connectivity, or if they just if that doesn't matter, they're still going for full connectivity to the to the data transfer notes, at least, even if they don't, they don't have the the like nurse where they want to support like

        1068
        15:00:41,960 --> 15:00:53,050
        Dale Carder: right, so it's helpful for me to think about this in terms of procurement life cycles, because I think the the Lcfs are also very much in that world

        1069
        15:00:53,060 --> 15:01:08,649
        Dale Carder: where you go out and you survey the user community for needs come up with a list of, you know, used cases that you're going to support, and you take the doe as part of Cd. Zero and say, Here, here's the mission need of what we do. Go into alternatives, analysis, and so on.

        1070
        15:01:08,970 --> 15:01:16,169
        Dale Carder: And then, five years later a machine shows up right on the dock right and and gets installed.

        1071
        15:01:16,180 --> 15:01:32,919
        Dale Carder: So it's really about being ahead of that, and then yes, networ in the exact same boat. So when we built the S. That's exactly the process we went through, beginning five six years ago, and here we are what they're going to have, like our official on grand unveiling next month.

        1072
        15:01:33,890 --> 15:01:52,230
        Dale Carder: Uh so in like, on the yesness side of the world, we've had these requirements, Reviews. So many of you here participated in the in the Requirements Review for for ha! We're currently doing one now for basic energy sciences, and this goes directly into our, You know, longer term procurement, forecast budgets

        1073
        15:01:52,240 --> 15:02:01,709
        Dale Carder: and things of of that nature, so that a we don't over build, you know, and spend a lot of taxpayer resources um way too early,

        1074
        15:02:01,730 --> 15:02:07,559
        Dale Carder: nor get caught on the other end, far, far behind from what where the needs lie.

        1075
        15:02:07,650 --> 15:02:13,420
        Dale Carder: This is essentially, we solve this on our end through just constant communication, and

        1076
        15:02:13,610 --> 15:02:21,599
        Dale Carder: beating people up like Andrew Mellow for for status as to what's going on, and and making sure that we're in lockstep.

        1077
        15:02:25,500 --> 15:02:41,649
        Dale Carder: So for for nurse like I said yesterday. Um like there has been a we're doing a requirements review for for basic uh energy sciences. And in there will be a case study for Lcls to, I think, and how that

        1078
        15:02:41,810 --> 15:02:47,979
        Dale Carder: that operation at Slack is going to be integrated with nurse because they're talking again. Terabytes

        1079
        15:02:48,100 --> 15:02:53,969
        Dale Carder: uh workflows from the beam line to compute, and then autonomous steering back.

        1080
        15:02:54,240 --> 15:03:00,560
        Dale Carder: So there's There's things there that that could be of relevance to this group to see how other groups are

        1081
        15:03:00,760 --> 15:03:02,240
        Dale Carder: are sort of handling it.

        1082
        15:03:07,670 --> 15:03:21,620
        Enrico Fermi Institute: So all right. So well, yeah, do you have one more? I'll do one more, and and if this is covered in another slide. Feel free to defer. Uh. But what's What's the yes net thinking on on on caching in the network.

        1083
        15:03:22,050 --> 15:03:31,530
        Dale Carder: Yeah, I'll. I'll have just a bullet on that. Yeah, we can. We can kind of open it up there as I get into the more.

        1084
        15:03:31,950 --> 15:03:36,629
        Dale Carder: Yeah, So let's talk about clouds. So there's sort of

        1085
        15:03:37,500 --> 15:03:40,180
        Dale Carder: the terminology around. Cloud stuff is like

        1086
        15:03:41,210 --> 15:03:44,880
        Dale Carder: amazingly hard to comprehend, because every vendor has their own

        1087
        15:03:45,130 --> 15:03:54,259
        Dale Carder: proprietary language, and they'll use the same words, and none of them are like actually descriptive of what's going on. But let's lump that into two bins,

        1088
        15:03:54,320 --> 15:03:56,470
        Dale Carder: Public cloud and Private cloud.

        1089
        15:03:56,820 --> 15:04:08,730
        Dale Carder: Public cloud is. You know what happens when you were to just log into an Ec. Two, console and and fire up a Vm. And you're going to get a network that's essentially, you know, to be this public facing

        1090
        15:04:08,790 --> 15:04:09,970
        Dale Carder: um,

        1091
        15:04:10,500 --> 15:04:17,009
        Dale Carder: and you know the those egress charges we you know, keep hearing about, apply, and things of that nature.

        1092
        15:04:17,210 --> 15:04:18,940
        Dale Carder: Private cloud.

        1093
        15:04:19,090 --> 15:04:33,729
        Dale Carder: This is where you would be um standing up, you know multitudes of um, instances of compute with some private back end network, and then that private back end network has some sort of, you know, egress

        1094
        15:04:34,030 --> 15:04:51,190
        Dale Carder: uh delivered through a multitude of means. But then that it has to connect to something out right right. It's fully self-contained. Uh, So you have to either connect back to your home institution or you some tunneling technology Uh, optionally. You can bring your own Ip, addressing Uh.

        1095
        15:04:51,480 --> 15:05:10,629
        Dale Carder: These typical workloads are administrative computing, so say, University of Chicago. Wanted to put the Hr system in in the cloud and keep it on the University of Chicago Network, as it Hr: Data, This is the technology would use. And you know, this should be like I should put this involved, but like

        1096
        15:05:10,650 --> 15:05:18,269
        Dale Carder: it's very expensive. And we're talking about data rates, you know, commiserate with administrative computing, not research computing.

        1097
        15:05:18,660 --> 15:05:34,160
        Dale Carder: And that's why you see, software Routers um software appliances doing these Cpms. So they've come up with um in addition to just multiple ways to extract money from you different ways to work around these limitations. So

        1098
        15:05:34,250 --> 15:05:43,919
        Dale Carder: if you're beyond the scale of what you can get away with with, you know, using a software based router and software. Based, you know, vpning traffic back to an institution.

        1099
        15:05:44,200 --> 15:05:58,230
        Dale Carder: There's uh dedicated interconnects that these are like, essentially charged by the hour uh connections. That's why I tried to put this like city going to the restaurant. This is the four dollars sign, you know. Uh menu option.

        1100
        15:05:58,400 --> 15:06:13,059
        Dale Carder: Um, you have Cloud Exchange, which was sort of some where you'd have like this uh intermediate broker, managing like the physical infrastructure for you. We have some of these today on Yes, now we're working to deprecate them because They're the three dollars same level,

        1101
        15:06:13,930 --> 15:06:24,750
        Dale Carder: and we're replacing them with this partner interconnection model, which is where you go out and you procure. And by you I mean like Yes, net goes out and procures a middle man

        1102
        15:06:24,860 --> 15:06:33,130
        Dale Carder: to handle this sort of like interconnection and get away from the hourly charge. Port charges um to the various entities

        1103
        15:06:33,290 --> 15:06:39,740
        Dale Carder: and throw some, you know, virtualization on top of that, and come out the only the two dollar sign approach.

        1104
        15:06:40,220 --> 15:06:42,260
        Dale Carder: But again, these are still it

        1105
        15:06:42,400 --> 15:06:55,590
        Dale Carder: humble data rates. Um! There's to put actual money on here. It's It's nearly impossible to. Uh, you know. Figure out what these things cost like. You need a, you know a used to car salesman to help you

        1106
        15:06:55,690 --> 15:07:09,500
        Dale Carder: uh figure this out, so putting that into like, Where are we today? Um connectivity wise. So in the public cloud realm again. Uh, if you' to stand up, you know

        1107
        15:07:09,640 --> 15:07:24,150
        Dale Carder: you know random sets of machines. This is sort of the connectivity we have, which is, you know, three hundred connections to major markets for Google, six connections to Oracle, five to Amazon, five to Microsoft.

        1108
        15:07:24,160 --> 15:07:35,560
        Dale Carder: And these are basically there and ready to go um, such as it's been mentioned earlier on, like Fermi labs being able to take advantage of the Google connectivity um

        1109
        15:07:35,840 --> 15:07:42,449
        Dale Carder: on a couple of occasions. Now, most recently, I think, in October, when last October, when there was that inference training on

        1110
        15:07:43,390 --> 15:07:51,479
        Dale Carder: um These are are very, very cost-effective. Such that we pay for these essentially out of the operating budget to the

        1111
        15:07:51,860 --> 15:08:07,890
        Dale Carder: So this is just our cost to doing business. We shared across all, do we? On it's. It's not a big problem, because what we do much like we built and teach the national labs we built. Yes, net six into the major commercial facilities. So we're there.

        1112
        15:08:07,900 --> 15:08:15,040
        Dale Carder: So a lot of these connections is just a jumper across the building. You know that kind of thing from our network to that that network There, go ahead.

        1113
        15:08:17,290 --> 15:08:23,250
        Dirk: But this this basically this doesn't give you a cost advantage. It just gives you capabilities. Right?

        1114
        15:08:23,260 --> 15:08:46,970
        Dirk: Yep, exactly. But this especially as Google. This matches very well with their uh, their flat, you know. Subscription model. Yeah, I mean, yeah, you still have the normal cost. So if you go. If you go just on demand, you just pay normal ecosystems. You just have to fast data connect there that we that you can actually run your workflows and then subscription. Okay, if you get rid of egress and you can, of course, use it fully. Okay. Thanks.

        1115
        15:08:46,980 --> 15:08:54,459
        Dale Carder: Yeah, exactly. I think Oracle also may grave egress fees. I forget who is using that. Uh in in do we

        1116
        15:08:57,180 --> 15:09:14,340
        Enrico Fermi Institute: so so quickly, though so to take advantage of this? If I were to log on the Ec. Two, and I've landed, and I guess the right availability. So I don't need to do anything special to jump on to. If i'm moving data from somewhere in Amazon to somewhere connected to Esnet. Six

        1117
        15:09:14,350 --> 15:09:18,790
        Enrico Fermi Institute: uh to me is the quote unquote user I don't have to do anything special to

        1118
        15:09:19,130 --> 15:09:32,770
        Dale Carder: right, and it's it's both. This whole slide probably applies both to Esnet and to Internet. To I think we're probably nearly identical and capabilities in this regard, because we it's just easy to scale up as as usage

        1119
        15:09:32,950 --> 15:09:38,519
        Dale Carder: is in place. One thing i'll point out, though you know, in these direct connections to these peers.

        1120
        15:09:38,560 --> 15:09:42,939
        Dale Carder: There is like human to human level negotiation to get these into place.

        1121
        15:09:42,970 --> 15:09:48,749
        Dale Carder: So, for example, it took months to connect to Google. You know, they said, well how much you're going to use, and we're like, I don't know all of it,

        1122
        15:09:49,050 --> 15:09:57,379
        Dale Carder: right. They were like, yeah, whatever. And then what we do we use all of you know, all of their gpus, for example, because we can.

        1123
        15:09:57,410 --> 15:10:10,549
        Dale Carder: Um, These providers are much more used to like diurnal traffic flows uh like you would see with, you know, commercial users during the day, and you know, residential users at night. Um, so there's like

        1124
        15:10:10,700 --> 15:10:15,559
        Dale Carder: to get these in place does require some negotiation and some long-reach plan,

        1125
        15:10:16,170 --> 15:10:19,010
        Dale Carder: because we have to talk them into it and prove we're going to use it

        1126
        15:10:21,010 --> 15:10:23,160
        Dale Carder: hollow. I see you've got your hand up.

        1127
        15:10:23,270 --> 15:10:38,420
        Paolo Calafiura (he): Yeah, I I It was a kind of a question already asked, but and then and then another one, so I I believe that there was also some uh peeing agreement with on the right if if we use your your boxes. But

        1128
        15:10:38,430 --> 15:11:05,940
        Paolo Calafiura (he): some discounts are you guys with up? If I recall correctly the Amazon. One is something more like if you use X amount of compute, some percentage of that can be. Yeah, yeah, exactly. Something on that. And then, just out of curiosity. Why do you have the most boxes to? Or a call Is Is Is that the just because it happened? And they were easy to deal with, or because there is a use for use it for a long

        1129
        15:11:05,950 --> 15:11:25,680
        Dale Carder: um. There's almost certainly someone going to use it like do is very, very big. So between uh office of science and an essay, and all the other stuff going on. Um, there's also, uh, you know, the the doe uh Federal network itself, which is now overlay on. Yes, net

        1130
        15:11:25,690 --> 15:11:30,750
        Dale Carder: sort of That's the nice thing about Once you hit the scale, we can kind of share the economics of this,

        1131
        15:11:32,400 --> 15:11:52,140
        Dale Carder: and then quickly I go over the the private cloud interconnects. Uh So this is where we have uh we're pretty into place actually, as we speak uh, tear a bit of connectivity to a third party called fabric packet fabric, and then they go through and punch uh uh physical connectivity into each of the vendors

        1132
        15:11:52,150 --> 15:12:07,749
        Dale Carder: for the private cloud hosting that will replace things like we had previously with uh the Cloud Exchange product. So again, that's It's a bit more um, you know, targeting administrative workloads. But

        1133
        15:12:07,790 --> 15:12:13,040
        Dale Carder: as we get into talking about Lhc. One, maybe it's a model that could be used there, too. I don't know

        1134
        15:12:13,120 --> 15:12:14,430
        Dale Carder: uh Dirk,

        1135
        15:12:17,590 --> 15:12:21,010
        Dirk: I think Fernando was first. If he wants to go.

        1136
        15:12:24,700 --> 15:12:31,579
        Dirk: No, now he will lower the sand. I I just had a quick question. So sorry. Sorry I was not. I was on mute,

        1137
        15:12:31,590 --> 15:12:48,210
        Fernando Harald Barreiro Megino: and I still didn't get over the public cloud section. So if i'm uh I need to be on Google on the Availability Zone, Seattle on the region Seattle, Chicago, or Nyc. In order to

        1138
        15:12:48,450 --> 15:12:54,120
        Fernando Harald Barreiro Megino: my trousers, go through Esnet.

        1139
        15:12:54,640 --> 15:13:11,879
        Dale Carder: Every vendor is different with Google. I think they announce, or they they will haul traffic regardless of where it ingresses or egresses. Their network. Amazon is the exact opposite where you have to send it to traffic to the exact zone. So everyone it

        1140
        15:13:11,890 --> 15:13:18,989
        Dale Carder: All these systems are proprietary in that regard, and you, unfortunately kind of have to know in advance what you're walking into,

        1141
        15:13:24,480 --> 15:13:31,589
        Fernando Harald Barreiro Megino: and there is a transit to I mean Southwest you to or anywhere at some university, and

        1142
        15:13:31,630 --> 15:13:34,459
        Fernando Harald Barreiro Megino: the Us. That will

        1143
        15:13:35,020 --> 15:13:40,880
        Fernando Harald Barreiro Megino: go through the in, I mean, through the normal Internet will not end up in years now, right

        1144
        15:13:41,730 --> 15:13:51,500
        Dale Carder: right? So for Google, that'd be the case for Amazon, Where? Yes, that does not appear with Amazon in Europe we would probably never see the traffic until it shows up through whatever

        1145
        15:13:51,750 --> 15:13:53,490
        Dale Carder: all their path exists.

        1146
        15:13:53,590 --> 15:13:54,440
        Okay,

        1147
        15:13:54,860 --> 15:13:57,260
        Fernando Harald Barreiro Megino: Okay, thanks.

        1148
        15:13:58,290 --> 15:14:08,890
        Dirk: Okay, And I had a question yesterday. That was when I, when we talked briefly about Lanceium. There was a I remember, from talking with them that they said they had plans to peer with.

        1149
        15:14:08,900 --> 15:14:24,610
        Dirk: I think it was the snet. Are you aware of anything? I mean? I think they're still building the data center? So i'm not sure at what stage they are with that on that. But our general, our our peering policy, is relatively wide open songs. We can justify it

        1150
        15:14:24,720 --> 15:14:34,139
        Dale Carder: uh so in any in new market entrance which should not be a barrier on the network as long as they show up at essentially any

        1151
        15:14:34,260 --> 15:14:44,270
        Dale Carder: major uh co-location facility where networks come and meet together. So, for example, we're in Houston we're in dallas we're in El Paso I mean kind of their neck of the woods.

        1152
        15:14:46,180 --> 15:14:48,409
        Dale Carder: So that question is very easy.

        1153
        15:14:53,330 --> 15:14:57,830
        Dale Carder: All right, all right. Um, This is sort of more like the

        1154
        15:14:58,240 --> 15:15:00,100
        Dale Carder: I dumped all the other stuff here.

        1155
        15:15:00,180 --> 15:15:07,480
        Dale Carder: Um. So some other things. Yes, that has um that sort of just worth having on your laundry list of things to know. It exists.

        1156
        15:15:07,620 --> 15:15:10,210
        Dale Carder: Um, one is, you know,

        1157
        15:15:10,530 --> 15:15:29,139
        Dale Carder: Api's. And you know, dynamic requesting of resources is something that Yes, that's um long since supported for layer, two circuits, including uh on bandwidth scheduling uh on demand um and prioritization. That is how the Ic. Op. And

        1158
        15:15:29,150 --> 15:15:32,900
        Dale Carder: uh circuits are instantiated between it's your zero and the tier one's

        1159
        15:15:33,160 --> 15:15:42,740
        Dale Carder: um also sort of in-flight is um dynamic layer three instantiation works internally the snet. We actually used it for lsst

        1160
        15:15:42,840 --> 15:15:47,320
        Dale Carder: uh between slack and uh, the the South American networks.

        1161
        15:15:47,570 --> 15:16:07,369
        Dale Carder: Um. It's completely conceivable to open that up also, and that could be used as a way if you wanted to. Dynamically, uh, you know, acquire cloud resources at the Api endpoint and and fire it up. Um! So these things are very much like near reality. Showed a used case. Justify for their development.

        1162
        15:16:07,500 --> 15:16:19,190
        Dale Carder: Um, there's an R. And D project underway with um Russia and integration with our framework called sense, which is again a more on uh, you know, dynamic network path, provisioning

        1163
        15:16:19,610 --> 15:16:24,199
        Dale Carder: across the

        1164
        15:16:24,920 --> 15:16:26,660
        Dale Carder: what you call it on

        1165
        15:16:27,930 --> 15:16:40,060
        Dale Carder: uh potential sort of sort of kicking off now. Um, where um sort of like the nurse super facility concept is sort of making it, you know the logical next sleep

        1166
        15:16:40,620 --> 15:16:55,390
        Dale Carder: uh internal to Yes, and we now have um Fpga experience in house. So we have uh been working on some projects where we're using Fpga's um to accelerate uh different

        1167
        15:16:55,450 --> 15:17:11,650
        Dale Carder: the sort of like used cases we've seen um from like uh triggers to compute uh and load dynamically, load bouncing and hardware, or working on something like that for J. Lab. I think there is also a similar effort underway between uh

        1168
        15:17:11,760 --> 15:17:13,539
        Dale Carder: Als and nurse

        1169
        15:17:14,050 --> 15:17:18,290
        Dale Carder: um. In addition, those Fpgas can be used to, you know,

        1170
        15:17:18,420 --> 15:17:28,050
        Dale Carder: in in my like crystal ball out like we think about. If we hit. You know the scaling limits of cpus. It also probably means we'll end up hitting the scaling limits of Tcp.

        1171
        15:17:28,150 --> 15:17:39,700
        Dale Carder: Um. Someone smarter than me is probably already figured out when that exists. But, uh, we're we're sort of ready for that era with the ability and yes net code running on Fpga's today

        1172
        15:17:40,630 --> 15:17:48,639
        Dale Carder: on the more operational side of the house. Um, we've got an R. And D project underway and deployment of uh packet marking

        1173
        15:17:48,650 --> 15:18:07,750
        Dale Carder: uh. So using annotations in like the Ipd six packet header to identify uh what workload is running, and then reporting that back out from like an accounting perspective of you know what science, domain, and activity is on a particular link to something that's been. That'll be pretty useful for for

        1174
        15:18:07,760 --> 15:18:12,620
        Dale Carder: planning, you know, capacity, planning, traffic, engineering, those sort of use cases.

        1175
        15:18:13,490 --> 15:18:27,130
        Dale Carder: And then here's my catch all for, uh, you know X cash. So uh I I think it's certainly a a promising future of more integrated. Uh, you know,

        1176
        15:18:27,230 --> 15:18:39,629
        Dale Carder: caching or even bigger picture storage on or in the network, and to better use the resources available to us, for example, latency hiding as being sort of an easier use case.

        1177
        15:18:39,640 --> 15:18:49,580
        Dale Carder: And then uh, I think there's currently cashes There's one in California. I don't know the status. There might be one in Chicago, and also one plan for Boston.

        1178
        15:18:50,090 --> 15:19:01,929
        Dale Carder: But it seems to me, you know, my engineer approach, for, like Guy, who doesn't make the decisions is that seems pretty straightforward, and something we should continue to work on.

        1179
        15:19:02,990 --> 15:19:21,869
        Enrico Fermi Institute: Yeah, go ahead. No, I was gonna say i'm sorry. I thought you were gonna go on to the next slide. I was going to say i'm just rambling. Um, Could you talk a bit more about this uh the the layer three vpn instantiation. So So who this would be? Science expanding um

        1180
        15:19:21,890 --> 15:19:41,860
        Dale Carder: their their vpn from from from whom? To whom? I guess you you know I mean how I build things. It's the support from anywhere to anywhere. So it's It's nebulous. Um, because it's you know it's a generic framework. So the idea being you've got site eight and site Z. They Wanna and and site F.

        1181
        15:19:42,270 --> 15:19:51,650
        Dale Carder: They could create a private network overlay on the for that activities. Um, traditionally, that was something very hard to do. You'd have to go around, and, you know, signal up circuits, or

        1182
        15:19:51,700 --> 15:19:59,089
        Dale Carder: do all this work Now it would look much more like, hey? Here's a vlan and can actually router into it, and it will just get to the other side,

        1183
        15:19:59,120 --> 15:20:09,500
        Dale Carder: and it's completely private. It's the same technology that cloud providers are using on the back end for their virtual private networks. So if you guys, they're using the same kind of thing.

        1184
        15:20:09,900 --> 15:20:17,080
        Enrico Fermi Institute: So so basically I I could hit some Api on on on the all side. And

        1185
        15:20:17,090 --> 15:20:42,269
        Enrico Fermi Institute: you would say, Okay, you connect to V Vlan Number five hundred and twenty-three, and the other one connects to. I don't know six hundred and seventy-two, and that vlan on your on your end is gonna um The vlines are, you know, tunneled together? You. You handle stitching together the layer two surface, or whatever it takes to get from point. Yeah, or even layer three circuits. So you've got full resiliency within Continental us, and that kind of thing.

        1186
        15:20:42,290 --> 15:20:45,820
        Dale Carder: So yeah, it's pretty promising. Uh,

        1187
        15:20:45,940 --> 15:21:03,750
        Dale Carder: I think. Just need more exploration of what the used cases are there like. We built it for ourselves, but there's nothing preventing that um and sort of how it was designed to to take the setup. One of these circuits takes takes longer to fill out the form in our database,

        1188
        15:21:03,760 --> 15:21:21,099
        Enrico Fermi Institute: Gotcha and does this, he said. Anyone anyone so I could in in potentially set this up, uh, you know, at Vayner built and have the other end be a cloud provider. Yeah, that's that's what i'm thinking could be a popular use case

        1189
        15:21:21,110 --> 15:21:25,710
        Dale Carder: right? And and maybe even wanted to have a second cloud provider. I mean, that's It's totally doable.

        1190
        15:21:25,820 --> 15:21:32,929
        Enrico Fermi Institute: Okay, Yeah, that definitely, I can definitely think of a a few interesting things you could do with that.

        1191
        15:21:33,340 --> 15:21:38,839
        Dale Carder: It it's something that again it's sort of like. Let's plant the seed of a you know, capability that exists,

        1192
        15:21:39,060 --> 15:21:41,780
        Enrico Fermi Institute: and see if there's a a good use for it.

        1193
        15:21:42,680 --> 15:21:44,940
        Enrico Fermi Institute: Oh, thank you. Yeah,

        1194
        15:21:46,080 --> 15:21:52,379
        Dale Carder: okay. And then here's where we drift off from the the known to the the less known.

        1195
        15:21:52,410 --> 15:22:02,250
        Dale Carder: So this in thinking about sort of these facilities as part of a greater ecosystem we've covered the do we space? Well, because, like Bonnie Hasn't,

        1196
        15:22:02,370 --> 15:22:16,099
        Dale Carder: Now, if you think about the Nsf. Hpc. Sites in particular, there's it's even more disparate as to their to their connectivity and capabilities. So some sites like off the top of my head

        1197
        15:22:16,860 --> 15:22:27,859
        Dale Carder: uh San Diego um, and do extremely well connected like wouldn't wouldn't worry about them. Um, because typically they have like they own their infrastructure. Um!

        1198
        15:22:27,880 --> 15:22:33,060
        Dale Carder: Ncsa is another one where moodles of network connect to it exists,

        1199
        15:22:33,270 --> 15:22:44,040
        Dale Carder: but then there's other centers i'll, you know, unfortunately, like I think, is in the scenario, where, like their machine, is like often some business park outside of town or

        1200
        15:22:44,290 --> 15:22:49,719
        Dale Carder: and it and there's not necessarily good connectivity to the for a data centric workflow.

        1201
        15:22:49,820 --> 15:22:56,510
        Dale Carder: So if you're thinking about running more on Nsf. Hpc. Facilities, you need to have a facilitation with

        1202
        15:22:56,690 --> 15:23:01,789
        Dale Carder: the sites you're thinking about to answer some key questions of. Can you get your data in and out

        1203
        15:23:02,030 --> 15:23:04,070
        Dale Carder: uh in a production fashion?

        1204
        15:23:04,380 --> 15:23:08,250
        Dale Carder: Because it's There's a huge disparity between sites

        1205
        15:23:09,720 --> 15:23:13,179
        Dale Carder: now on the Us. Side. Um,

        1206
        15:23:13,510 --> 15:23:24,780
        Dale Carder: We covered some of that um just before I started. But what of note? Yes, and that is talking to every single Us tier, two site basically in preparation for high luminosity.

        1207
        15:23:25,070 --> 15:23:32,899
        Dale Carder: As such we were sort of like getting a good view as to where the the universities are, with their regional networks,

        1208
        15:23:33,370 --> 15:23:41,009
        Dale Carder: and in general, I think, with enough prior planning which was our goal. The outlook continues to be good.

        1209
        15:23:41,080 --> 15:23:44,200
        Dale Carder: Um! But we need to keep that facilitation game up

        1210
        15:23:44,300 --> 15:23:56,660
        Dale Carder: uh and make sure that you know for especially universities that have one or two intermediate networks between them and yes, another internet too, that everything upgrades and lockstep, or we can't connect these things together,

        1211
        15:23:57,680 --> 15:24:00,260
        Dale Carder: so that that present looks

        1212
        15:24:00,450 --> 15:24:18,510
        Dale Carder: good, and the key to making this work from my perspective is the data challenges a thing where we can point to and say By this date it has to work as follows: The The data challenges are are going to be the the forcing function that the the community uses the for internal justification. The,

        1213
        15:24:18,520 --> 15:24:30,059
        Dale Carder: you know, show their administration like the you know, the the pro or whatever like. Hey, We do need this stuff, and and here's where we're on. We need it by, and that that program is finally important.

        1214
        15:24:32,030 --> 15:24:38,670
        Dale Carder: Now, on to the the perhaps more questioning stuff on my part, which is

        1215
        15:24:38,700 --> 15:24:46,129
        Dale Carder: this community has a network called Lhc. One which is sort of called a whole nother Internet connecting just the

        1216
        15:24:46,150 --> 15:24:49,229
        Dale Carder: resources together that exclusively

        1217
        15:24:49,350 --> 15:24:53,840
        Dale Carder: you know work on these large-scale projects for Lhc

        1218
        15:24:54,160 --> 15:25:08,620
        Dale Carder: So in the Us. You've got um Us Cms and us Atlas sites the tier one's in the tier, two centers connected to Lc. One, and then yes, net has transatlantic connectivity where we connect to our our peer networks in the Eu

        1219
        15:25:09,200 --> 15:25:13,050
        Dale Carder: again to the major tier, one into your two centers.

        1220
        15:25:13,580 --> 15:25:16,670
        Dale Carder: On those networks there is, you know,

        1221
        15:25:17,460 --> 15:25:28,839
        Dale Carder: for better or worse. Ip addresses are used as authorization tokens for what traffic can go on to that network, because that network has an acceptable use policy defining what can and can't be on it.

        1222
        15:25:29,270 --> 15:25:34,900
        Dale Carder: Uh, namely, it's exclusive. It's for exclusive use of obviously traffic.

        1223
        15:25:35,250 --> 15:25:46,019
        Dale Carder: Now, in the case where you've got a dedicated facility and all it does. Or Maybe you have dedicated Dtn machines, and all they do is um, you know, traffic that's Top related.

        1224
        15:25:46,070 --> 15:25:54,429
        Dale Carder: It's pretty straightforward when you start thinking about cloud resources, or even some of the bigger clusters, even seen like an open science grid.

        1225
        15:25:54,440 --> 15:26:08,900
        Dale Carder: These are multi-science uh compute nodes. And we talked to our our peers at Brooklyn. This is already happening there where they they have cluster, can run any job but this restriction of what traffic can go over. Lhc. One

        1226
        15:26:09,470 --> 15:26:13,799
        Dale Carder: a limiting factor, because now the source Ip address of the Node banners

        1227
        15:26:13,910 --> 15:26:16,910
        Dale Carder: and trying to adhere to the aup,

        1228
        15:26:16,990 --> 15:26:18,649
        Dale Carder: Is there a problem?

        1229
        15:26:19,940 --> 15:26:25,579
        Dale Carder: So we figured this out essentially to the degree of very static resources. Right? This works

        1230
        15:26:25,710 --> 15:26:29,680
        Dale Carder: very well for the tier ones and tier two is especially in the Us.

        1231
        15:26:29,890 --> 15:26:40,300
        Dale Carder: But it does not to me have a clear understanding of how you would integrate external resources into this. Um!

        1232
        15:26:40,580 --> 15:26:50,020
        Dale Carder: It's an open discussion uh at this point. It's not like I'm here with any answer. I'm just saying like I think we can all agree that something to be worked on

        1233
        15:26:50,830 --> 15:27:07,149
        Dale Carder: um that has big and public implications, particularly for the transatlantic traffic. So that's why I had this on. Here is yes, that currently has five, one hundred. You pass across the Atlantic. We're bringing up uh two additional four hundred gig um paths.

        1234
        15:27:07,210 --> 15:27:08,380
        Dale Carder: Um

        1235
        15:27:08,730 --> 15:27:12,250
        Dale Carder: sometime next year. Hopefully, these are like very, very.

        1236
        15:27:12,320 --> 15:27:21,510
        Dale Carder: It intensive builds um to get. You know we're not just buying circuits. We're buying spectrum on undersea cables and integrating it into our network

        1237
        15:27:21,930 --> 15:27:25,809
        Dale Carder: so and then the contracting side of this is

        1238
        15:27:26,100 --> 15:27:39,229
        Dale Carder: mind-bogglingly complex and these are multi year procurements with nda's in place. So we have additional links that we're going to come in after these two by four hundred. So we're trying to get on additional cables with additional spectrum.

        1239
        15:27:39,240 --> 15:27:46,920
        Dale Carder: All of this is very easy for us to integrate into X one. It's very easy for us, and straightforward it integrated into the do ecosystem

        1240
        15:27:47,680 --> 15:27:53,679
        Dale Carder: Again, How would you use? How would you use that with the like third party? Cloud sites

        1241
        15:27:54,780 --> 15:27:58,399
        Dale Carder: open for exploration? It's not clear.

        1242
        15:27:59,460 --> 15:28:16,819
        Enrico Fermi Institute: So so is it fair to say that, you know it seems like all the physical capability is kind of there when it comes to talking to clouds, but you know, doing things like getting a block of ips and announcing those to Lhc. One is, and challenging with the public clouds.

        1243
        15:28:16,880 --> 15:28:17,920
        Dale Carder: Yup,

        1244
        15:28:18,030 --> 15:28:19,470
        Dale Carder: um!

        1245
        15:28:19,580 --> 15:28:27,449
        Dale Carder: Whereas maybe a more straightforward topology is actually maybe something more like he cloud where

        1246
        15:28:27,590 --> 15:28:29,010
        Dale Carder: you know, from

        1247
        15:28:29,040 --> 15:28:32,810
        Dale Carder: the networks perspective, it's Fermi lab on either end.

        1248
        15:28:32,960 --> 15:28:36,210
        Dale Carder: It's for me lab stuff in the cloud for me to have stuff at home,

        1249
        15:28:36,310 --> 15:28:37,490
        Dale Carder: and then it

        1250
        15:28:37,660 --> 15:28:39,219
        Dale Carder: and branch out

        1251
        15:28:39,710 --> 15:28:59,000
        Enrico Fermi Institute: that maybe it more workable model for at least for doe. So you're saying like for for transatlantic traffic you would. The Fermi lab is kind of the the responsible party for making sure that that they're agreeing with the aup and their traffic going across the transatlantic link is is Lhc. Traffic, and

        1252
        15:28:59,010 --> 15:29:02,869
        Enrico Fermi Institute: and an appearing happens between the cloud and and Fermi Lab,

        1253
        15:29:03,190 --> 15:29:11,249
        Dale Carder: or and yes. But yeah, So the the essentially is such that you know any do we resource can can do whatever they want?

        1254
        15:29:11,340 --> 15:29:24,840
        Dale Carder: Um, including talk to universities. Um. But at present the aup doesn't straightforwardly allow a tier two to use cloud resources that would be brokered by Yes, and as the middle man,

        1255
        15:29:25,600 --> 15:29:31,230
        Dale Carder: to use a cloud resource and expect it to use all this transatlantic capability

        1256
        15:29:31,390 --> 15:29:33,569
        Dale Carder: on that he always invested in.

        1257
        15:29:36,890 --> 15:29:48,730
        Dirk: Yeah, I wanted to comment on that, and I I think I mean you already said that that's part of the the strategy that you Cms is going with with with. Have cloud that we

        1258
        15:29:48,830 --> 15:29:51,169
        Dirk: we kind of keep it contained.

        1259
        15:29:51,180 --> 15:30:08,940
        Dirk: So if we like the large okay, we haven't done anything with large cloud use in a while like, not nothing like the the Amazon test and the Google test, and five six years ago. But but even then I think we only targeted regions, the resources in the Us. So that the the kind of the data traffic,

        1260
        15:30:09,170 --> 15:30:17,359
        Dirk: the data traffic was contained in the Us. Mostly to between Fermi lab and these external resources, and then any kind of

        1261
        15:30:17,370 --> 15:30:31,629
        Dirk: output. The output is transferred over the transfer Linux somewhere else to your up inside. That, then, is an independent step that comes after, and it can go through the Lg. One network because it's it originates at Fermi L. At that point,

        1262
        15:30:31,860 --> 15:30:48,299
        Dirk: and the same way for the Hpc. Integration that we, the the way we integrate these Hpc resources is is they're connected to Fermi Lab. Everything stays together basically uh with with uh within the Us. And um,

        1263
        15:30:48,550 --> 15:30:56,320
        Dirk: I don't know. I mean, Fernando, if if if you have a contract as as if Cms would have a cloud contact and they would want to do a run

        1264
        15:30:56,390 --> 15:31:05,890
        Dirk: where they basically use all the regions in the world together. Then that's obviously. Then then it becomes a a problem. Because you're you're talking about overlaying Uh:

        1265
        15:31:06,820 --> 15:31:14,690
        Dirk: the global cloud resource. Mix on top of a somewhat partition network infrastructure.

        1266
        15:31:17,700 --> 15:31:20,260
        Dirk: Fernando: What regions are you using right now?

        1267
        15:31:20,630 --> 15:31:26,110
        Dirk: Okay. So it's It's all Europe okay

        1268
        15:31:28,290 --> 15:31:32,210
        Dale Carder: and just domestic to the Us. Um,

        1269
        15:31:32,900 --> 15:31:35,930
        Dale Carder: you know the universities, like the tier, two sites

        1270
        15:31:36,070 --> 15:31:42,160
        Dale Carder: to large degree of separated their You know their Lhc traffic from the rest of their institution traffic.

        1271
        15:31:42,440 --> 15:31:43,320
        If

        1272
        15:31:43,420 --> 15:31:48,399
        Dale Carder: those lines are to get blurred that could have essentially uh impact

        1273
        15:31:48,560 --> 15:31:55,439
        Dale Carder: on the universities, you know, like you can imagine scientific workloads overwhelming. You know the cat videos and streaming lectures

        1274
        15:31:55,630 --> 15:32:03,819
        Dale Carder: right? So it's. It's something to be quite mindful of how the current sort of ecosystem is built, and if you wanted to more fit

        1275
        15:32:04,130 --> 15:32:07,150
        Dale Carder: the communication necessary to do so,

        1276
        15:32:15,340 --> 15:32:18,480
        Dale Carder: So that's that's what I had. I'm happy to

        1277
        15:32:18,540 --> 15:32:21,630
        Dale Carder: answer more questions, or even just

        1278
        15:32:22,990 --> 15:32:29,649
        Enrico Fermi Institute: I had a small question. Yeah, you. You mentioned that the the connectivity to Nsf.

        1279
        15:32:29,710 --> 15:32:53,440
        Enrico Fermi Institute: Uh sites is, uh, I guess, Spotty, maybe. Yeah, I notice how I didn't put that in the slide, but I can. I can read between the lines. Um! So what is so? You know that there's this. There's a facility that's being built up or is built outside of Boston, some acronym, but it's like a green data center type thing um that all of the Boston area

        1280
        15:32:53,450 --> 15:33:13,260
        Enrico Fermi Institute: uh, and something that both Cms and I know Alice's as well as they have some large storage um, some large tape library that uh that we've each bought some part into it. Is this: Uh: on the end of the the better connected. Uh,

        1281
        15:33:13,630 --> 15:33:21,159
        Dale Carder: yeah, it It benefits that. You know it's basically on network for Mit.

        1282
        15:33:21,440 --> 15:33:35,470
        Dale Carder: So all right, So they They're facilitating a lot of the They're even going to be facilitating. I think it was in the interim the connectivity for uh for net two, which is the atlas uh node there right

        1283
        15:33:35,830 --> 15:33:50,070
        Dale Carder: so I don't know if there's anyone from Mit on that call here, but I think the majority of their stuff is at base lab. It's not an Mp. Pcc. But net. Two does have their their new infrastructure, and their existing infrastructure will be at Mghpcc.

        1284
        15:33:50,720 --> 15:33:53,170
        Dale Carder: And right they have some,

        1285
        15:33:53,690 --> 15:33:58,469
        Dale Carder: you know, magic storage back into my understanding. They're gonna leverage for that.

        1286
        15:33:58,920 --> 15:34:17,979
        Enrico Fermi Institute: Uh, I think that one of the folks uh they have a very large Ibm tape library with Gpfs upfront.

        1287
        15:34:20,850 --> 15:34:23,289
        Dale Carder: So you get another question. Hand up, David.

        1288
        15:34:24,860 --> 15:34:37,569
        David Southwick: Hi, thanks. Um! Maybe This is a naive question. But if you've got in the current scenario of traffic, let's say tunneling through for me. Um! And you're wanting to add

        1289
        15:34:38,030 --> 15:34:44,399
        David Southwick: for whatever uh cloud providers, and they're all at two hundred four hundred gigabit.

        1290
        15:34:45,120 --> 15:34:55,080
        David Southwick: You get a bottleneck when you do that

        1291
        15:34:55,480 --> 15:34:59,770
        Dale Carder: right? So that sort of architecture is fine to a point.

        1292
        15:35:02,510 --> 15:35:05,119
        David Southwick: Okay, thanks. I think I understand.

        1293
        15:35:05,180 --> 15:35:08,980
        Dirk: Maybe to to say something. I mean the what we did with

        1294
        15:35:09,510 --> 15:35:17,119
        Dirk: through Hep Cloud integration. It's not So much tunneling for farming is that you basically keep the problem set contained to

        1295
        15:35:17,130 --> 15:35:36,849
        Dirk: for me plus cloud. And then in a later, in the completely asynchronously of of the first one. It's of how Fermi integrates with the rest of the Lhc. In infrastructure. So you you kind of to tie it together at the storage level. Basically you move some data to Fermi Lab. And then, independently of that, once that data actually sits there.

        1296
        15:35:36,860 --> 15:35:49,760
        Dirk: Then you can schedule work on that data that can run on on cloud sites. And then the network traffic to get that data to the cloud side runs from farming. Basically So they're independent steps. But of course, I mean eventually,

        1297
        15:35:50,210 --> 15:36:07,170
        Dirk: just because you removed the timing, and it's not an immediate tunnel it's. Still, you still have to get that to keep these resources fed at at cloud, and also at Hbc. Side. So eventually, as as the integrated capacity you want to, you want to feed in terms of computing

        1298
        15:36:07,180 --> 15:36:15,850
        Dirk: it goes up. You you kind of you have to also work, on the other hand, to basically keep the pipeline full of things to work on.

        1299
        15:36:17,670 --> 15:36:25,430
        Enrico Fermi Institute: So so with that with that connectivity, or with the connectivity that's in place today with that model that the Fermi lab

        1300
        15:36:25,440 --> 15:36:45,199
        Enrico Fermi Institute: used or is using, I mean, would that be able to take advantage of of all that physical connectivity? I mean the thing i'm kind of struggling with is, how do we? How do we go from, You know? Yes, that has all this great physical connectivity to clouds. Uh to You know. How do we take advantage of that? And a meeting full way.

        1301
        15:36:45,210 --> 15:37:01,969
        Enrico Fermi Institute: You know what I mean, and and I know a lot of that kind of falls under your bucket of things that are hazy and need to be investigated more. Um, you know. Is it? Is it that you know, like if we were to do this for Alice, should we, you know, mediate all of the data transfer through the tier one and and kind of,

        1302
        15:37:02,140 --> 15:37:06,130
        Enrico Fermi Institute: I guess orthogonalize the problem kind of like how formula it has right where you have

        1303
        15:37:06,160 --> 15:37:11,739
        Enrico Fermi Institute: connectivity from from cloud to to national lab as one bit, and then national lab to

        1304
        15:37:11,780 --> 15:37:29,779
        Dale Carder: right. So you've got that, I mean, that's the class of solutions. Right? That's the solution space. If you want to work within those confines. If I are, you know, a program officer at uh doe or Nsf. I would say,

        1305
        15:37:30,210 --> 15:37:34,860
        Dale Carder: Why do you need to do that? What are the other barriers that exist?

        1306
        15:37:34,930 --> 15:37:38,519
        Dale Carder: Tackle those as well? Because some of these are social, political,

        1307
        15:37:38,680 --> 15:37:54,999
        Enrico Fermi Institute: alright. So it's sort of just where do you want to? I mean, of course, our goal is to, you know, have something to to say in the report, right? And so what what recommendation should we make right that that people go

        1308
        15:37:55,310 --> 15:38:05,190
        Dale Carder: right? So on that front, on one thing that basically came out of this community. Um, if you want to back way up, was the current um

        1309
        15:38:05,300 --> 15:38:23,189
        Dale Carder: grant system at Nsf. Has through the what's now the Cc star uh program that facilitates campus uh and regional upgrades basically manifested from the Yes net science Team Z model. And then, asf uh community buying in that is the

        1310
        15:38:23,310 --> 15:38:36,369
        Dale Carder: an an architectural model that they should provide, you know, financial support, for if you could extend upon that and say, You know, if you can imagine a world where you could seamlessly take advantage of resources, no matter where they lie. What would you need?

        1311
        15:38:36,610 --> 15:38:42,729
        Dale Carder: Couldn't us that program evolve, or again facilitate that kind of uh, you know,

        1312
        15:38:42,760 --> 15:38:43,990
        Dale Carder: connectivity,

        1313
        15:38:45,710 --> 15:38:50,300
        Dale Carder: you know. And in the time scale where it's talking about that's not unreasonable.

        1314
        15:38:54,850 --> 15:38:55,900
        Enrico Fermi Institute: Okay,

        1315
        15:38:57,950 --> 15:39:00,670
        Enrico Fermi Institute: were there other questions for Dale.

        1316
        15:39:08,300 --> 15:39:13,280
        Enrico Fermi Institute: Okay? Well, thanks a lot, Dale. I think this is a really interesting discussion.

        1317
        15:39:13,310 --> 15:39:20,250
        Dale Carder: Yeah, um. And i'll stick around um for the rest of the conference, too. So more stuff comes up. Um,

        1318
        15:39:20,320 --> 15:39:21,780
        Enrico Fermi Institute: yeah, that'd be great.

        1319
        15:39:22,730 --> 15:39:25,880
        Enrico Fermi Institute: All right. I will try to go back to the

        1320
        15:39:26,070 --> 15:39:28,459
        Enrico Fermi Institute: sharing the slides over here.

      • 14:00
        Discussion 1h

        [Eastern Time]

         

        More R&D and Discussion

         

        1321
        15:39:28,910 --> 15:39:30,250
        Enrico Fermi Institute: Um.

        1322
        15:39:30,800 --> 15:39:36,420
        Enrico Fermi Institute: So the next section. This kind of leads in the next section. We wanted to talk a little bit about

        1323
        15:39:36,490 --> 15:39:38,910
        Enrico Fermi Institute: R. And d efforts.

        1324
        15:39:41,730 --> 15:39:44,150
        Enrico Fermi Institute: Now we've covered some of this already.

        1325
        15:39:46,490 --> 15:39:50,170
        Enrico Fermi Institute: Um, Dirk, Did you want to say a couple of things about this?

        1326
        15:39:50,440 --> 15:40:03,390
        Dirk: Yeah. And that that comes through? This comes directly i'll look at on the comes directly from a question that's in the charge Where, basically, ask us, is there anything we can do on the on the site, or that

        1327
        15:40:03,670 --> 15:40:05,530
        Dirk: that is needed to

        1328
        15:40:05,900 --> 15:40:09,369
        Dirk: to what we can do needed to expand

        1329
        15:40:09,590 --> 15:40:23,570
        Dirk: the range of what we can do on commercial Cloud and Hpc. All increase the cost effect on us, which kind of goes hand in hand. And uh, we already talked a little bit about Lcf integration and the Hpc. Focus area that there's

        1330
        15:40:23,640 --> 15:40:27,459
        Dirk: work to be done on the Gpu workloads, which is

        1331
        15:40:27,810 --> 15:40:35,630
        Dirk: somewhat out of scope for this conference, because we're not for this workshop because we're not supposed to talk about framework on software development.

        1332
        15:40:35,680 --> 15:40:52,100
        Dirk: Um, But then there's also integration work. We talked a little bit about this on the cost side that it's a bit at this point uh like estimating Lc. F Long term operations. Cost is a bit hard because the integration is not fully worked out.

        1333
        15:40:52,170 --> 15:41:01,009
        Dirk: Um software delivery kind of during the Hbc. Focus for every kind of agreed that even if it's is everywhere,

        1334
        15:41:01,020 --> 15:41:12,510
        Dirk: and then there's at services which is also every Hpc. Seems to do their own thing and what they support. They all want to support it, but they kind of have different solutions in place,

        1335
        15:41:12,540 --> 15:41:15,390
        Dirk: and it's also to me at least a bit unclear

        1336
        15:41:15,420 --> 15:41:20,420
        Dirk: with the long-term operational needs there on this area.

        1337
        15:41:20,900 --> 15:41:28,610
        Dirk: And then we already talked a little bit about dynamic cloud users. Uh, which means basically you you. You do like your whole.

        1338
        15:41:28,750 --> 15:41:44,449
        Dirk: The whole processing chain inside the cloud uh phenomena talked about that a little bit because it to reduce e egress charges. We basically you copy and your input data once or and then do multiple processing runs on it and

        1339
        15:41:44,460 --> 15:41:56,950
        Dirk: only keep the the end result basically and forget about the the intermediate output, and then you save one. You don't have to get it out. You only have to get the smaller final output. We already talked about machine learning.

        1340
        15:41:58,040 --> 15:41:59,560
        Dirk: And then uh,

        1341
        15:42:01,030 --> 15:42:20,909
        Dirk: there's uh on d work on on different architectures to be able to support this uh which opens up possibilities in both Hpc. And Cloud use uh Fpga's um various Gpu types that feeds into the Gpu workloads, but it's not exclusive to just uh

        1342
        15:42:21,080 --> 15:42:32,130
        Dirk: uh Gpu workloads, because it could also be machine learning like, How how do we integrating machine learning to make use of these new architectures? And that's gonna be

        1343
        15:42:32,750 --> 15:42:35,820
        Dirk: integration on D. But also basic uh

        1344
        15:42:35,910 --> 15:42:41,970
        Dirk: basic on the on, on some of these topics. And then there was some

        1345
        15:42:42,710 --> 15:42:50,129
        Dirk: things that we're kind of playing around with unique that are unique to the cloud where they offering platforms that we

        1346
        15:42:50,460 --> 15:43:07,240
        Dirk: kinda that's hard to replicate in-house uh like there's, some a big, very big table experiments function as a service. I don't know too much about it. We just threw it on here. Maybe Lindsay or Mike could say something about that, or someone else that's more familiar with that.

        1347
        15:43:10,780 --> 15:43:17,670
        Paolo Calafiura (he): I won't say that i'm familiar with functions as a service. But I I just want to mention that this is also

        1348
        15:43:17,690 --> 15:43:30,329
        Paolo Calafiura (he): um an area important for Hpc's. Then they are developing. They are developing a function that they probably the same at the same framework, the Funkx framework. Yes,

        1349
        15:43:30,340 --> 15:43:48,699
        Paolo Calafiura (he): and uh that there is that there is apparently a a sol in for to to the main Lcs of of of bunkx using something called. So. This is something we are very interested in to a Cc. Is a possible joint project across the

        1350
        15:43:48,710 --> 15:44:05,420
        Enrico Fermi Institute: so I guess from personal experience. Uh, we actually quite routinely use parcel uh for farming out the analysis jobs. Uh, and at some point back in the day there was a proof of concept

        1351
        15:44:05,430 --> 15:44:11,389
        Enrico Fermi Institute: using a Fung X endpoint and doing analysis jobs with that

        1352
        15:44:11,420 --> 15:44:36,939
        Enrico Fermi Institute: um. So all of the groundwork for that is actually been laid out. Um! And we could return to using that. We just ended up using desk a little bit more prevalent prevalently. But it's also something that's up up to the user at, or that we left up to the user at the end of the day, and if we want to develop more infrastructure around that uh it, we have a basis to start prompt

        1353
        15:44:36,950 --> 15:44:53,969
        Enrico Fermi Institute: uh as far as like going to like production workflows or reconstruction, or something like that. I don't think that's been explored at all. Um, but it's it. It looked really promising and interesting from the analysis uh analysis view of things.

        1354
        15:44:53,980 --> 15:45:07,179
        Enrico Fermi Institute: And I think at the time it was just a little bit immature compared to where things have gone more recently for bigquery and big table. I think this is actually

        1355
        15:45:07,960 --> 15:45:21,790
        Enrico Fermi Institute: uh, right. This is this. This was studied by Gordon Watson Company, and they did a a couple of benchmarkings of what the performance per dollar was for analysis, like queries on

        1356
        15:45:21,830 --> 15:45:26,670
        Enrico Fermi Institute: data sets backed by various engines,

        1357
        15:45:27,330 --> 15:45:44,309
        Enrico Fermi Institute: and we could go and take a look at that paper. But the gist of it was that bigquery and big table are uh not nearly as cost-efficient as uh using Rdf. For instance, or or coffee um, or well, awkward, or a plus uproot, for instance.

        1358
        15:45:44,320 --> 15:46:01,499
        Enrico Fermi Institute: So there's already some demonstrations that while these offerings are there, they're not quite up to the performance that we can already provide with our home grown tools. But maybe this also provides a uh way to talk with the bigger cloud services and say, Hey,

        1359
        15:46:01,510 --> 15:46:06,510
        Enrico Fermi Institute: this is the kind of performance we need. Can we do any? And beenance matching here?

        1360
        15:46:08,310 --> 15:46:15,509
        Dirk: What? Sorry That was a bit of an information. No, it's fine. But the thing is this is all

        1361
        15:46:16,260 --> 15:46:27,480
        Dirk: the question. One basic question I had about this is while while some of these areas that are being worked on can provide quite a a great

        1362
        15:46:27,500 --> 15:46:34,020
        Dirk: improvement and user experience like. And at the analysis level you you just

        1363
        15:46:34,090 --> 15:46:40,670
        Dirk: yeah to to what extent are the applicable. If you look at like a global picture of

        1364
        15:46:40,730 --> 15:46:48,980
        Dirk: experiment resource use, I mean that because the individual user experience doesn't necessarily mean you. You save a lot of

        1365
        15:46:48,990 --> 15:47:02,780
        Dirk: resources overall, but you you can make life easier for your users, and you improve the physics output, and that's all great. It's just um in terms of looking at that application of

        1366
        15:47:02,950 --> 15:47:09,230
        Dirk: of of money in terms is Is this a large enough area that that we have to

        1367
        15:47:10,220 --> 15:47:23,680
        Enrico Fermi Institute: how prominently should we put it into the report? Basically, that's what i'm trying to get

        1368
        15:47:23,690 --> 15:47:37,939
        Enrico Fermi Institute: as you make things more scalable, so that the folks can like you know, do the the first exploratory bits of their analysis from their laptop, and then scale that seamlessly into the cloud with Funkac, or whatever do that we admit

        1369
        15:47:38,120 --> 15:47:55,869
        Enrico Fermi Institute: um, if you can make it so that those first exploratory steps are at less scale, then of course, that means that the resource usage, as you scale up more and more, is going to be much more uniform between all the users that you one

        1370
        15:47:55,880 --> 15:48:08,630
        Enrico Fermi Institute: uh have engaging with the system, which means you can probably schedule it all a little bit better, as far as you know, which I think is, is another way of saying. You know you just make things nicer for the users. Um,

        1371
        15:48:08,640 --> 15:48:28,579
        Enrico Fermi Institute: but it one. It means that. Uh, uh, we are figuring out a schedule all that becomes easier, which means it becomes, uh, more efficient from your perspective, or from the operational perspective, I would say. And then, uh, it also changes the way in which people

        1372
        15:48:28,590 --> 15:48:51,250
        Enrico Fermi Institute: uh compete for resources at clusters, because all the analysis start looking more and more the same. Um. And they also start reaching the larger resources at a higher level of maturity than perhaps what you see even nowadays. Sometimes people just run stuff and see what happens. And it's very, very experimental, software let's say

        1373
        15:48:51,260 --> 15:48:54,349
        Enrico Fermi Institute: um. So I I

        1374
        15:48:54,520 --> 15:49:00,139
        Enrico Fermi Institute: to answer your question of like, is this big enough to care?

        1375
        15:49:00,760 --> 15:49:15,249
        Enrico Fermi Institute: I have a feeling that right now it is big enough to care, and the fact that we're getting more data is going to keep it in the regime of being big enough to care and report and make sure that we actually make a special or treat this,

        1376
        15:49:15,260 --> 15:49:40,909
        Enrico Fermi Institute: at least in a special way, because the resource uses pattern is wildly different from production. Um! But as we roll out these uh the things like functions as a service, or uh figure out how to scale a column or analysis, and our data frame effectively uh it's going to mink the competition. Or yeah, it's going to make the usage of resources less and easier to manage, which is kind of good for us.

        1377
        15:49:40,920 --> 15:49:53,019
        Enrico Fermi Institute: But also uh it's not going to make it a bigger piece of the competition for all the computing resources, So that's what it sort of looks like in my mind kind of extrapolating from what we have right now. One hundred and fifty.

        1378
        15:49:53,070 --> 15:50:12,099
        Enrico Fermi Institute: Uh, I think The answer then is, uh, we need. We need to watch it and see what these systems that are just starting to come online actually do for resource usage in uh, even if it's not at scale and see if it does bring kind of this evening out of of competition for resources at tier two is

        1379
        15:50:12,110 --> 15:50:15,289
        Enrico Fermi Institute: um and otherwise making the analysis,

        1380
        15:50:15,620 --> 15:50:21,180
        Enrico Fermi Institute: analysis, computing usage a bit more, even as far as

        1381
        15:50:21,370 --> 15:50:25,670
        Enrico Fermi Institute: sorry, even as far as Job submission goes. And things like that.

        1382
        15:50:25,860 --> 15:50:29,870
        Enrico Fermi Institute: That's that's sort of my view. I I of course.

        1383
        15:50:30,000 --> 15:50:38,340
        Enrico Fermi Institute: Yeah, this is trying to predict the future. So other people please feel free to predict the future, too, and we can see what works

        1384
        15:50:39,280 --> 15:50:57,220
        Paolo Calafiura (he): always always very informative to hear to hear from you Parents? Uh, I I'm certainly not nearly as competent, and I know that are more competent people in the call who may want to chime in. But uh, our interest from the Cc. Sign

        1385
        15:51:05,270 --> 15:51:24,750
        Paolo Calafiura (he): complex Enough that the paradox. And by the way, Derek, you yesterday we heard that that the Cms. Cms uh is um sort of fighting against the provisioning, challenging the provision challenges, you know, creating workers with the with the right.

        1386
        15:51:24,760 --> 15:51:28,160
        Paolo Calafiura (he): Uh, we divide the capabilities.

        1387
        15:51:28,170 --> 15:51:50,549
        Paolo Calafiura (he): Uh, you know to some extent that I don't know which has since, because i'm in combat, that these issues have been addressed by the by, the by, the folks with developed parts of So some of those issues uh have made the Atlas think that far so it could be a good back end for some of our existing code in in this sort of

        1388
        15:51:50,560 --> 15:51:56,159
        Paolo Calafiura (he): and I I I i'm hoping that somebody has more competent jump.

        1389
        15:51:57,290 --> 15:52:13,480
        Enrico Fermi Institute: Um! The only thing that I can tack on to that is that uh Anna and Company back in the day uh figured out how to make a back filling system uh using funkx and parcels. So that's that's definitely something that works

        1390
        15:52:13,530 --> 15:52:29,769
        Enrico Fermi Institute: Um, and you can, and that's also what the guys at Nebraska are doing with the last or with the the coffee Casa analysis facility as they're back filling into the production jobs. So for sure, this is a pattern that works, and that people can implement. But,

        1391
        15:52:29,780 --> 15:52:34,630
        Enrico Fermi Institute: uh, we also we don't. We don't know how it how it scales out uh

        1392
        15:52:34,750 --> 15:52:43,950
        Enrico Fermi Institute: you know, to more and more data and more and more users. The The usage right now, I would say, is fairly limited. And yeah, that's

        1393
        15:52:45,020 --> 15:52:50,759
        Enrico Fermi Institute: I. I think that helps at some context. But we definitely need to hear from more people on this,

        1394
        15:52:51,470 --> 15:52:59,310
        Dirk: hey? Maybe just one comment that Jeff, we're primarily interested in production here. But, on the other hand, analysis takes over

        1395
        15:52:59,610 --> 15:53:06,270
        Dirk: half our resources or half the tools, at least, so there's a significant fraction. So if analysis gets easier you

        1396
        15:53:06,690 --> 15:53:13,279
        Dirk: that means maybe there's more resources for production to use just as a quick correction. It's only a quarter dirk.

        1397
        15:53:13,390 --> 15:53:18,340
        Dirk: Oh, it's a quarter of it. You I thought it's half the T. Choose. Now it's a quarter

        1398
        15:53:18,530 --> 15:53:20,280
        Dirk: that's a quarter. Now. Okay,

        1399
        15:53:20,350 --> 15:53:28,460
        Enrico Fermi Institute: yeah, as a as more production just shows up the the the fraction gets smaller and smaller.

        1400
        15:53:33,200 --> 15:53:46,199
        Enrico Fermi Institute: But Yeah, there, I mean, just thinking about it more. There's also this rather severe impedance mismatch, at least right now, with the kind of the can. The cadence of analysis jobs versus uh production cops,

        1401
        15:53:46,210 --> 15:53:55,879
        Enrico Fermi Institute: since it's much more bursty and short-lived as opposed to a production job that comes in, and you know it's going to use twenty four hours in a slot or something like that.

        1402
        15:53:56,180 --> 15:54:02,060
        Enrico Fermi Institute: So it's by it. By its very nature it's a much more adaptive

        1403
        15:54:02,510 --> 15:54:06,890
        Enrico Fermi Institute: and reactive scheduling problem.

        1404
        15:54:20,280 --> 15:54:28,630
        Enrico Fermi Institute: So one of the things that we mentioned with the cloud offerings, I mean, we had a couple of examples. There are big, very big table functions of the service.

        1405
        15:54:28,650 --> 15:54:47,950
        Enrico Fermi Institute: One of the questions I had at least, was it. Is there anything i'm missing right like on the cloud? Right? Because if you go and look at the service catalog for something like aws. It has this humongous, you know, spread of, of of things that they can services that they offer. Uh, is there anything that we're

        1406
        15:54:47,990 --> 15:54:49,940
        Enrico Fermi Institute: leaving on the table that we should

        1407
        15:54:50,600 --> 15:54:51,950
        Enrico Fermi Institute: you should look into?

        1408
        15:54:55,200 --> 15:54:59,800
        Enrico Fermi Institute: Uh, I'll say that something that's interesting.

        1409
        15:55:00,150 --> 15:55:18,890
        Enrico Fermi Institute: Maybe not. Maybe not just for uh clouds, but also for sort of on premises. Facilities is uh things like sonic that lets us sort of um disaggregate the gpus and the cpus. So if you're doing inference, you might not need a whole Gpu. But

        1410
        15:55:18,900 --> 15:55:27,490
        Enrico Fermi Institute: you know, as someone you know, either you buy very expensive, Let's say in the cloud case. Let's just stick that. So, you know you might have to buy. You might be buying a bunch of Gpu nodes

        1411
        15:55:27,500 --> 15:55:39,980
        Enrico Fermi Institute: uh which are many times more expensive. But you know, if the reconstruction path only needs a quarter of a gpu being able to independently scale up the number of gpus and cpus that you're running at a time. Um,

        1412
        15:55:39,990 --> 15:55:51,770
        Enrico Fermi Institute: it's something useful. And then, like I mentioned like for on premises stuff, too, because you can stick either two or four of these gpus into a box. But if the core count is two hundred and fifty-six on the node, then

        1413
        15:55:52,010 --> 15:55:54,990
        Enrico Fermi Institute: you you better hope that the the

        1414
        15:55:55,060 --> 15:56:01,679
        Enrico Fermi Institute: the fraction of time that you're spending a gpu and the speed up that you get, you know, and dolls, law and all that actually makes it worthwhile

        1415
        15:56:12,330 --> 15:56:13,160
        you.

        1416
        15:56:19,070 --> 15:56:38,129
        Enrico Fermi Institute: Yes, and and going on to that like there's also going there is going to be uh, and there already is, and it will be an ever growing class of analysis user, that is asking for Gpus, too, and you have to again deal with this very different rate of scheduling resources for them.

        1417
        15:56:38,430 --> 15:56:55,730
        Enrico Fermi Institute: Um, and sometimes there the amount of, or at least the the the burstiness of the data processing that they're trying to do on that Tv is much much higher compared to like a production job. Even if the resource, the total resources, are much higher on the production side, just because of job multiplicity

        1418
        15:56:55,740 --> 15:57:19,540
        Enrico Fermi Institute: that you have users that are, you know, just poking around doing their exploratory stuff, and right now we give them a whole T four. Well, t four per hour is not cheap, not cheap at all. So and you'll have people like training models and then loading it onto a T for running their running, running their whole signal data set, or something like that, to see what it looks like in the tails, et cetera, et cetera, or running it on their backgrounds.

        1419
        15:57:19,580 --> 15:57:24,290
        Enrico Fermi Institute: And it's still the same problem of needing to

        1420
        15:57:24,450 --> 15:57:42,980
        Enrico Fermi Institute: uh, very piecemeal uh schedule your gpus, and then on top of that schedule, all the networking between them, because you have this really insane burst of uh inference requests for a very short amount of time that you need to negotiate on your network to not net or not mess with everyone else's jobs.

        1421
        15:57:43,170 --> 15:57:44,580
        Enrico Fermi Institute: So

        1422
        15:57:44,620 --> 15:57:54,399
        Enrico Fermi Institute: it might not be. It might not be a huge what you said It's a quarter of the tier two right now. It's. Let's say it just stays a quarter of that. But the

        1423
        15:57:54,590 --> 15:58:09,069
        Enrico Fermi Institute: the the way that it's going to be using the resources if it's that bursty may not look like a quarter at certain points in time during the analysis workflow, and that's something we have to be ready to deal with.

        1424
        15:58:09,370 --> 15:58:13,230
        Enrico Fermi Institute: I have no idea how to actually schedule that.

        1425
        15:58:13,490 --> 15:58:14,539
        Mhm

        1426
        15:58:19,200 --> 15:58:23,320
        Enrico Fermi Institute: So so we're almost at the top of the hour.

        1427
        15:58:23,800 --> 15:58:28,420
        Enrico Fermi Institute: So any other topics that we wanted to hit before we wrap up for the day.

        1428
        15:58:41,590 --> 15:58:47,809
        Enrico Fermi Institute: So I think logistically, we were going to tomorrow. Talk a little bit about.

        1429
        15:58:49,090 --> 15:58:54,949
        Enrico Fermi Institute: See? In the morning I think we were going to talk about accounting and pledging.

        1430
        15:58:55,240 --> 15:58:57,530
        Enrico Fermi Institute: We're going to talk about some, you know.

        1431
        15:58:57,840 --> 15:59:14,780
        Enrico Fermi Institute: Facility, features, policies. How did a discussion about security topics when it comes to Hpc. And Cloud. Um: Yeah. Allocations, you know, planning that sort of thing, I think, in the afternoon,

        1432
        15:59:14,790 --> 15:59:18,350
        Enrico Fermi Institute: and have a a presentation from the

        1433
        15:59:18,520 --> 15:59:22,869
        Enrico Fermi Institute: from the Vera Rubin folks to talk about their experiences.

        1434
        15:59:23,700 --> 15:59:42,449
        Enrico Fermi Institute: And then, yeah, some summary type of work and and just you know other other topics or observations that people would like to bring up. So I mean, if there's something that that hasn't that we haven't hit on the agenda that people would really like to talk about um tomorrow afternoon. It'd be a really good time to to bring that

        1435
        15:59:47,150 --> 15:59:49,349
        Enrico Fermi Institute: anything else from anyone.

        1436
        15:59:55,150 --> 16:00:00,209
        Enrico Fermi Institute: Okay, sounds like, Not all right, Thanks, everybody. We'll talk to you tomorrow.

        1437
        16:00:01,790 --> 16:00:03,559
        Fernando Harald Barreiro Megino: Hi. Thank you.

    • 10:00 12:00
      Third Day Morning

      [Eastern Time]

       

      [Enrico Fermi Institute] 11:00:03
      Cool Dale is indeed here

      [Enrico Fermi Institute] 11:00:25
      We'll wait a couple more minutes for folks to filter in

      [Enrico Fermi Institute] 11:01:41
      oh!

      [Enrico Fermi Institute] 11:02:29
      e

      [Enrico Fermi Institute] 11:02:37
      Yes, slides

      [Enrico Fermi Institute] 11:03:24
      okay, we're gonna get here started here just a minute.

      [Enrico Fermi Institute] 11:03:28
      So thanks everyone for making it to the to the final day of the workshop goal.

      [Enrico Fermi Institute] 11:03:35
      For this morning will be to do a discussion about number of topics, one which is counting the other would be to continue our pledging discussion.

      [Enrico Fermi Institute] 11:03:50
      assuming, you know, of all the appropriate people are here for that.

      [Enrico Fermi Institute] 11:03:53
      another topic. We wanted to cover was security, you know, both on the on the clouds and the Hpcs.

      [Enrico Fermi Institute] 11:04:02
      And I think ultimately the the stuff that we cover this morning will inform for poly recommendations that we would, that we would make as part of the report in the afternoon.

      [Enrico Fermi Institute] 11:04:17
      I don't expect us to take the full 2 h may not even take the full 2 h this morning.

      [Enrico Fermi Institute] 11:04:26
      but in the afternoon we'll talk about the We'll have a presentation from the vera ribbon folks, and then we'll I think we have a couple other minor topics any other, business, and and sort of next steps for work to go with with the us.

      [Enrico Fermi Institute] 11:04:47
      see So about 20 people. Online So that's probably good to going here.

      [Enrico Fermi Institute] 11:04:56
      So yeah.

      [Enrico Fermi Institute] 11:05:01
      Dirk, did you want to make comments about this one?

      [Dirk Hufnagel] 11:05:03
      yeah, I I can. I can. Get the started. So as as with previous slides, so that the green is is a question that's directly copied from the charge and the the question, was what can the us at less than us, Cms: recommend to the collaboration to improve the utilization of commercial and Sbc

      [Dirk Hufnagel] 11:05:23
      resources, and both in terms of enabling more work for so we can run there, and also improving, and also improving the cost effectiveness of using these resources, and and one thing that that was all that already came up in the previous days was but we'll have a dedicated slot

      [Dirk Hufnagel] 11:05:44
      for here, so we'll see how much excellent discussion we're good is that at the moment these resources are nice to have say opportunistically, they give us some free cycles where we can run some things but they're not included.

      [Dirk Hufnagel] 11:05:58
      In the pledge, so we don't get full credit for them, and we don't them down included in the planning.

      [Dirk Hufnagel] 11:06:06
      another. The thing is that to make use of sites like the Lcf.

      [Enrico Fermi Institute] 11:06:08
      Let's

      [Dirk Hufnagel] 11:06:12
      Specifically, but also other opportunities that are available in cloud.

      [Enrico Fermi Institute] 11:06:15
      Okay.

      [Dirk Hufnagel] 11:06:16
      And then our own good sides potentially, we need Gpu payloads.

      [Dirk Hufnagel] 11:06:20
      We need some way to utilize, deployed Gpu resources on the cloud side We had a lot of discussion yesterday on that in the cost section and doing the networking the the pick, big worry a big part of the where you're on cost on cloud is egress.

      [Enrico Fermi Institute] 11:06:24
      See.

      [Enrico Fermi Institute] 11:06:27
      Okay.

      [Dirk Hufnagel] 11:06:41
      So minimizing egress or negotiating some way, be the peering agreements.

      [Dirk Hufnagel] 11:06:49
      Either subscription. Where this cost is basically removed or targeting cloud resources that don't have egress charges.

      [Dirk Hufnagel] 11:06:57
      We basically want to avoid the the egress costs that currently are, they're not dominating.

      [Dirk Hufnagel] 11:07:03
      But they are large factor of the whole. Cloud cost calculation, and on the Hbc side the focus will be on reducing operational overheads, especially on the Lcf.

      [Dirk Hufnagel] 11:07:19
      Side, where this is still a little bit of an on D area in terms of figuring out what the what the final operational model will look like, and then another way to reach cost and make make it easier to operate hpc resources is to get sizable storage allocations because it makes

      [Dirk Hufnagel] 11:07:39
      something simpler.

      [Dirk Hufnagel] 11:07:47
      We could move on, Maybe at this point, if you're talking about accounting, which is, which is part of like having this these types of resources be a fully equivalent player, I mean, pledging is one side but then you need to take care of of accounting to make sure that you actually deliver what

      [Dirk Hufnagel] 11:08:06
      you promise to deliver. We have a comment. I guess we can.

      [Dirk Hufnagel] 11:08:10
      We can do that right right away

      [Enrico Fermi Institute] 11:08:14
      Yeah, feel free to jump in at any time.

      [Dirk Hufnagel] 11:08:16
      Eric

      [Eric Lancon] 11:08:17
      Yes, good morning. Before we change the slide. I think the first recommendation would be that Atlas and Cms both work on the

      [Eric Lancon] 11:08:35
      Having a Gpu, or whatever or accelerator find the payloads. Because if you don't have a generic software, you you would not be able to use at hi resources

      [Eric Lancon] 11:08:56
      And if they are very restricted, these resources to a specific kind of application, let's say, just specific.

      [Eric Lancon] 11:09:07
      evolution, because it's suited to to Hey, Gpus, it There would be no way to get them plagued, because pledges are as they, are for now.

      [Enrico Fermi Institute] 11:09:13
      Okay.

      [Eric Lancon] 11:09:19
      It's for all will work close together. It's not greatest by type of workflow, so

      [Dirk Hufnagel] 11:09:28
      Yes, that that's that's one problem, And we can discuss this later that that the moment the pledges I mean, you can.

      [Enrico Fermi Institute] 11:09:33
      Sure.

      [Dirk Hufnagel] 11:09:34
      You can pledge across Cpu architecture. You can see that because you can.

      [Dirk Hufnagel] 11:09:39
      You can run your normalized, whatever this be. Right now. Hs: 0 6 in the future will be half score.

      [Dirk Hufnagel] 11:09:45
      You can run your benchmark, you get your Cpu speed it.

      [Dirk Hufnagel] 11:09:49
      It includes some averaging for the airbus, but at least it's in principle possible.

      [Dirk Hufnagel] 11:09:54
      Gpu is more tricky, because how do you account for that?

      [Dirk Hufnagel] 11:09:56
      Do you give like a 20% bonus? Do you look at the most commonly used workflow?

      [Dirk Hufnagel] 11:10:02
      And I think that's a that's something we can't decide here.

      [Dirk Hufnagel] 11:10:06
      That's something we'll have to discuss with Wlcg how they once we have these, these, these workloads I can use Gpu and get a benefit from them.

      [Dirk Hufnagel] 11:10:16
      How does the the how do you factor? That in and the performance normalization?

      [Dirk Hufnagel] 11:10:21
      Is there gonna be some extra factors? Is a different category that you pledge?

      [Dirk Hufnagel] 11:10:25
      I I don't know. I don't have the answers, but it's it's an area that that needs to be discussed.

      [Dirk Hufnagel] 11:10:30
      But Wlcg

      [Eric Lancon] 11:10:33
      Yes, but the recommendation to the experiments would be first to actually right.

      [Enrico Fermi Institute] 11:10:39
      The

      [Eric Lancon] 11:10:41
      The development of a Gpu friendly software

      [Dirk Hufnagel] 11:10:45
      Yes, I mean, we're doing that, I mean, there are 2 avenues and cms at least at 2 avenues, and and that's that's the machine.

      [Dirk Hufnagel] 11:10:53
      Learning that we talked about yesterday, which is that that might be the most common at the moment between Cms and Atlas, because similar training frameworks, and so on.

      [Dirk Hufnagel] 11:11:04
      That you could use in the framework itself. For obvious reasons.

      [Dirk Hufnagel] 11:11:08
      This is somewhat distinct from each other. But we're both looking at that.

      [Dirk Hufnagel] 11:11:12
      Cms: maybe is a little bit more advanced than Atlas because of the the the push from the hlt, and the deployment of Gpu Also resources there.

      [Dirk Hufnagel] 11:11:21
      But I'm sure that Atlas is also at least looking at this

      [Enrico Fermi Institute] 11:11:31
      Okay.

      [Dirk Hufnagel] 11:11:36
      Lincoln. Can you go to the next slide?

      [Dirk Hufnagel] 11:11:39
      If that comment was addressed. So, looking at accounting, so maybe at this point, this is a good time.

      [Dirk Hufnagel] 11:11:46
      We have an invited contribution for benchmarking, because for accounting and for pledging one of the prerequisites is that you have to know what you're actually pledging in terms not just not in terms of course, but in terms, of some normalized numbers, hs, 6 or in the

      [Dirk Hufnagel] 11:12:05
      future have score, so that there that Maria offered contributed Talk about benchmarking of Hbc I think

      [Enrico Fermi Institute] 11:12:06
      Good.

      [Dirk Hufnagel] 11:12:16
      And David wanted to give the toocacy connected

      [David Southwick] 11:12:20
      yeah, I am. Can you hear me?

      [Dirk Hufnagel] 11:12:21
      Okay, Yeah, Do you want to? Okay, Do you want to share the slides or otherwise Link can also share

      [Enrico Fermi Institute] 11:12:22
      Yeah, we'll I'm not sharing anything

      [David Southwick] 11:12:25
      Okay. Just like my headphones have

      [David Southwick] 11:12:35
      Just a second here

      [David Southwick] 11:12:38
      Okay, you can still hear me. Great. So I do have just a few short slides.

      [Enrico Fermi Institute] 11:12:41
      We can.

      [Enrico Fermi Institute] 11:12:45
      Okay, okay.

      [David Southwick] 11:12:45
      That I'd be happy to hear

      [David Southwick] 11:12:49
      See if I can do this

      [Enrico Fermi Institute] 11:12:51
      Sure.

      [David Southwick] 11:12:59
      Yes, hmm.

      [David Southwick] 11:13:06
      Okay.

      [David Southwick] 11:13:13
      Okay, Do you see the? So I do. Okay? Great: So Yeah, I've got a few comments, and just to share a bit by what we're doing.

      [Enrico Fermi Institute] 11:13:15
      We do. Yep. Looks good

      [David Southwick] 11:13:27
      With Hp. See with the head benchmarking Last couple of years I've been collaborating with the hex benchmarking working group really to take the this replacement model, for he speculated 6.

      [David Southwick] 11:13:49
      And really developed this that so that it can work on Hpc.

      [David Southwick] 11:13:52
      As well, cause I'm I'm sure, sure many people are very familiar with this, and how it was done in the past.

      [David Southwick] 11:13:57
      It was meant to be as as similar as possible to the bare metal works.

      [David Southwick] 11:14:04
      Notes that Wcg. Was using. So it was Vms, or at some point nested containers.

      [David Southwick] 11:14:11
      And things like this that are no way compatible with Hbc: So, but in a bunch of work to to make this as lightweight and user friendly as possible, So it's totally rootless.

      [Enrico Fermi Institute] 11:14:23
      Okay.

      [David Southwick] 11:14:27
      Now, we're using switch to singularity images, and then also a bunch of quality of life things that that allow you to use it on sites that have, you know, don't have wide area in networking tool Okay, things like that So it's really, been a big big it for

      [David Southwick] 11:14:46
      the last year or so a bit about the sweet self.

      [David Southwick] 11:14:51
      I'm sure some of you are No, you're over.

      [David Southwick] 11:14:52
      Maybe I've run this already since we've been distributing say, proof of concept or release candidates here for the last couple of months.

      [David Southwick] 11:15:04
      basically it's well, I love this. I already shared, But it's now sort of flexible thing that, and you can run on any hardware.

      [David Southwick] 11:15:14
      You can see, I've got a small graphic on the right right here, or the suite itself is an orchestrator that will go and collect metadata of whatever hardware it's running on and control the array of benchmarks you want to use so on that the bottom

      [David Southwick] 11:15:30
      part is graphics. There's a couple different benchmarks had spec of 6 is one of them He score, which is the the the candidate replacement for it or I don't know if I can call it Canada anymore.

      [David Southwick] 11:15:46
      but you can easily plug in other mitch marks as well.

      [David Southwick] 11:15:49
      So this is the tool we've been using on Hbc: So a bit about that, This effort, like I said, started.

      [David Southwick] 11:15:59
      I guess more than a year ago. So I and the initial presentation of the Hpc.

      [David Southwick] 11:16:07
      Work that was during chat 21, and at that time we had just done large-scale deployments, so doing several 100,000 core campaigns, and at that time we were looking at comparing New Amd.

      [David Southwick] 11:16:22
      Mdc. Views that were available widely on Hpc.

      [David Southwick] 11:16:26
      Sites, but not yet widely accessible, and from so we did a comparison of that, and stability, studies and whatnot.

      [David Southwick] 11:16:37
      So that was interesting. The first step. What's happened since then is we've had a lot of software become available from the experiments.

      [David Southwick] 11:16:48
      obviously the first look at the run. 3 workloads, but along with that there's been a bunch of heterogeneous development from basically all of the experiments.

      [David Southwick] 11:16:57
      So and I mean heterogeneous, both in compile compile codes and in accelerators.

      [David Southwick] 11:17:06
      So we've got several workloads that are in development for power, and of course, Gpus So we've been using Hbc Then to take these these workloads, or let's say, snapshots.

      [David Southwick] 11:17:23
      Of them. They're sufficiently stable. I will containerize them in singularity, and then run them at scale on Hpc.

      [David Southwick] 11:17:32
      And that enables a lot of different interesting studies. So Gpu versus cpu, and then combined for some, we're close there, support that as well as more exotic combination.

      [David Southwick] 11:17:46
      So arm, plus gpu said Power plus gpu, and things like this.

      [David Southwick] 11:17:54
      And I know this was discussed. Excellent! Some point yesterday.

      [David Southwick] 11:17:56
      I think I was listening, in but in case these are now available via the benchmarking suites, and you can run it just as you would on a bare metal machine on Hbc.

      [David Southwick] 11:18:12
      I think you it was presentation yesterday from Eric about the in Lps: Yeah, I workload.

      [David Southwick] 11:18:20
      That's also containerized, and can be run at scale on each Pc.

      [David Southwick] 11:18:25
      however, the configuration they have at the moment is a the the snapshot of that is just single note.

      [David Southwick] 11:18:31
      At the moment. So if you do want to use that for Npi related, scaling, then you need to run just the workload container and not the the suite, because it well it's not the target at the moment of of Wcg to do Npi and as I mentioned I guess we've

      [David Southwick] 11:18:54
      got a lot of other quality of life things. So if you have local storage like Cdm Fs, you could take advantage of this instead of room, pull copies and whatnot.

      [David Southwick] 11:19:03
      But really, as has been set here many times there's a lot of interest in Gpus, and then you can see I got a small slice here in the number.

      [David Southwick] 11:19:15
      Of course you get from current and next generation Gpus.

      [David Southwick] 11:19:21
      So there's I love to going in that direction, and we know we're expecting close.

      [David Southwick] 11:19:30
      so, that being said, there isn't industry, standard Gpu benchmark yet, and at least from the benchmarking side of things.

      [David Southwick] 11:19:43
      We are kind of approaching it in the same way that we have Cpus.

      [David Southwick] 11:19:48
      Oh, you use workload set production workloads, or what will be production workloads, And we can generate the score in the same way that we do for head score which is some function of throughput or events.

      [David Southwick] 11:19:59
      Per second. So this is what we've been using so far with trying to understand the capabilities of a machine that is going to be running Gpu.

      [David Southwick] 11:20:13
      Only Cpu, and I mean there was a he score work last week, I think several people are probably part of, but a lot of discussion happened there as well of how to account for for these sorts of resources.

      [David Southwick] 11:20:34
      so how just concluded, saying that we we've been active on Hbc.

      [David Southwick] 11:20:38
      Benchmarking now for couple years. Use the suite because it's automated running and reporting of large scale.

      [Enrico Fermi Institute] 11:20:39
      okay.

      [David Southwick] 11:20:48
      You can do whole part or several partition benchmarks.

      [David Southwick] 11:20:52
      This includes exotic workloads both for machine learning.

      [David Southwick] 11:20:56
      Yeah, I as well as architectures as well as starting to look at.

      [David Southwick] 11:21:03
      Let's see, sort of the other services that you get on Hpc: I know it was mentioned yesterday that it issues with scaling.

      [David Southwick] 11:21:14
      Io bomb workloads. And how can you tell what what's good on a shared filer system that maybe you don't have any information about.

      [David Southwick] 11:21:24
      So we are starting to develop. And we've got a prototype.

      [David Southwick] 11:21:27
      Let's see Mark. I say benchmark kind of in quotes, because it's not benchmarking a a compute, unit, but testing the shared filer service.

      [David Southwick] 11:21:37
      And then from there giving you some feedback on both your workload and let's say, how many nodes you could scale that up to before where it starts locking up that the file system in some way.

      [David Southwick] 11:21:53
      So that's what's new with us, and a little bit of a peek into.

      [David Southwick] 11:21:57
      We're doing in where we're going and I'm happy to answer any questions

      [Enrico Fermi Institute] 11:22:06
      Okay, David. Thank you very much. So we have a couple of hands raised.

      [Enrico Fermi Institute] 11:22:10
      follow up! Good

      [Paolo Calafiura (he)] 11:22:12
      good morning, everyone. So the the first you said that you said that the there are no industry standard Ml.

      [Paolo Calafiura (he)] 11:22:23
      Benchmarks, and I think that is still accurate.

      [Paolo Calafiura (he)] 11:22:26
      But I want to meet. I want to be sure you guys are aware of Ml.

      [Paolo Calafiura (he)] 11:22:30
      Per which is becoming a little system, and it's not so.

      [Paolo Calafiura (he)] 11:22:36
      So okay. So that's that was the first. The first comment.

      [David Southwick] 11:22:36
      yeah.

      [Paolo Calafiura (he)] 11:22:41
      The second comment is that the and I mean, I I You guys have a very difficult job, because what we are seeing in Tc.

      [Paolo Calafiura (he)] 11:22:52
      Is that the same software? I mean the same out of it around on different software, on different platform, performs quite differently.

      [Paolo Calafiura (he)] 11:23:04
      So, if you have, a if you have a fast parameter or simulation, we should run that with alpaca, or with the kuda, or with the of well, cool, that is a different problem without Packer. With caucus, you'd get different different performance, of the same code in

      [Enrico Fermi Institute] 11:23:13
      Hmm.

      [Paolo Calafiura (he)] 11:23:22
      principle on day for different machines depending. What what portability layers to probability to you.

      [Paolo Calafiura (he)] 11:23:31
      So I wanted to ask you if you have a settled on a platform, for like parallelization platforms, but to use, or if you are taking the world, you know you're taking a mix so what what's your problem

      [David Southwick] 11:23:45
      so I don't think it's settled at the moment, and of course this is a a popular question of what sort of optimization targets you're using for your workloads.

      [David Southwick] 11:23:56
      And I mean not just for

      [David Southwick] 11:24:01
      Translating the Cross architecture. But even within the same families of of of units.

      [David Southwick] 11:24:09
      So at the moment, I I think the method we have is take the minimum compatibility, so that we don't having, you know, 1020 different versions of of the workplace Okay, want to have the same thing that we can run everywhere and and sort of take all of these variables out of the

      [Enrico Fermi Institute] 11:24:18
      Okay.

      [David Southwick] 11:24:30
      equation, But that being said, I don't think there's a clear answer on on the proper way to do that yet, and it's not not so.

      [David Southwick] 11:24:42
      Let's say.

      [Enrico Fermi Institute] 11:24:48
      Okay, hum. See, Steve has this hand raised

      [Steven Timm] 11:24:51
      Yeah, I was just wondering if the toolkit is available anywhere for download that have been

      [David Southwick] 11:24:58
      Oh, good, Yeah, Absolutely So the let's say I'll just go back.

      [David Southwick] 11:25:04
      I think I have a link to. And so the benchmarking suite itself, that is it. While all that is open source.

      [David Southwick] 11:25:13
      It's on git lab at certain. So flash, hip, benchmarks, and then the suite is the the project down there.

      [Steven Timm] 11:25:15
      oh, okay, I see it. Okay.

      [David Southwick] 11:25:22
      I have the link on the screen, the yeah, the other benchmark I was talking about, for let's say services.

      [David Southwick] 11:25:30
      This Io Benchmark. I don't have the link on there, but I can share this afterwards.

      [David Southwick] 11:25:36
      It's in a That's right. Prototype state right now.

      [David Southwick] 11:25:40
      We have been working on this year. It doesn't cover all the things you can throw at it yet.

      [David Southwick] 11:25:47
      But yeah, you can download it and play around with it.

      [David Southwick] 11:25:51
      And I should mention that the idea for this will really is from, I I think I don't know if he's in in the room now, but it I've seen, and the previous days so shut up. To.

      [David Southwick] 11:26:04
      Him, I guess.

      [Enrico Fermi Institute] 11:26:07
      Okay, Okay, go ahead.

      [Steven Timm] 11:26:10
      Hello! There was no different so they have Benchmark 3.

      [Steven Timm] 11:26:14
      That's no different than the when they were running on regular note.

      [Steven Timm] 11:26:17
      Then

      [David Southwick] 11:26:18
      Yeah, Yep: exactly.

      [Steven Timm] 11:26:20
      Okay? Good. And then there may be a newly special one.

      [David Southwick] 11:26:25
      Well, there, yeah, like I said, the there was a workshop last week on this discussing.

      [David Southwick] 11:26:33
      you know how to choose the final versions and the waiting and whatnot.

      [David Southwick] 11:26:37
      So there's a little bit more qualified people around, I think, to answer specific questions on on that.

      [David Southwick] 11:26:43
      But it's in progress. Yeah.

      [Enrico Fermi Institute] 11:26:47
      Okay.

      [David Southwick] 11:26:48
      There will be another version, I guess, with the the let's say, gold standard for okay for the Benchmark suite.

      [David Southwick] 11:26:55
      That's decided

      [Enrico Fermi Institute] 11:27:02
      Sure.

      [Dirk Hufnagel] 11:27:04
      yeah, I just had it quick. Question. You said, You're you're benchmarking cpu plus gpu and also, Gpu workloads.

      [Dirk Hufnagel] 11:27:13
      Now I can see. I mean hapscore, and have specs zeros on cpu that's like well established.

      [Dirk Hufnagel] 11:27:20
      You take a mix of experiment, specific workloads and average something, throws something together and get get some average.

      [David Southwick] 11:27:26
      Yep.

      [Dirk Hufnagel] 11:27:26
      I have. What do you do for the Gpu stuff?

      [Dirk Hufnagel] 11:27:30
      Because it's so early, and and the experiments I'll algorithm.

      [Dirk Hufnagel] 11:27:34
      I know Cms has something, but it's not complete.

      [Dirk Hufnagel] 11:27:36
      It's not a complete picture. Do you run synthetic stuff, or do you run the very early stuff?

      [Dirk Hufnagel] 11:27:42
      Because that's the only thing you can do

      [David Southwick] 11:27:43
      We this there. Yeah, So we're running very early stuff from Cms: There's an Lpf.

      [David Southwick] 11:27:50
      Which is a bit of a an exact bird, which Yeah, there was a talk yesterday on that.

      [David Southwick] 11:27:56
      we're we also are using hlt and then as well, the sort of rolling builds from check windows, from mangraph

      [Dirk Hufnagel] 11:28:11
      Okay, thanks.

      [David Southwick] 11:28:13
      I guess there's also some other exotic Gpu workloads, but these are from Beams Department.

      [David Southwick] 11:28:21
      So there's this simple track, or

      [David Southwick] 11:28:26
      I know ped track is in there as well, but I don't think we have a container for Patrack

      [Dirk Hufnagel] 11:28:31
      so it's it's early going. So the numbers you get might not necessarily be representative of ever we end up running in production later

      [David Southwick] 11:28:40
      Exactly so. I mean, there's a lot of results results already, since we you know, we use the the suite as a reporting tool as well.

      [David Southwick] 11:28:48
      So it's it all gets pushed up over Amq into Cabana, and then you can All the workloads are hashed so you can compared performance across every node that it's run on with the same version of the world Okay, so at least with.

      [David Southwick] 11:29:04
      You know that published version? If you have your own bill, let's say you can track You can compare device to device like this, but you're right.

      [David Southwick] 11:29:13
      I mean, these are really, or let's say, snapshot releases of some of these.

      [David Southwick] 11:29:19
      So they will change whenever it's decided that that's going to be a production.

      [David Southwick] 11:29:25
      Or let's say, a a final version validated in some way.

      [David Southwick] 11:29:28
      Yeah.

      [Paolo Calafiura (he)] 11:29:29
      so starting to jump in in Cc. We have assembled the kind of what we think is A is a cross-section of representative applications, and of course we have no, we have No.

      [Paolo Calafiura (he)] 11:29:42
      Standing is just like asking around. I wonder if we should compare now, and we can see if if we, if we pick the same, and under which configuration maybe we should have an offline discussion between our groups

      [David Southwick] 11:29:54
      Sure you're using workforce or like off the shell benchmarks

      [Paolo Calafiura (he)] 11:30:02
      No, no, we're using using a Hp workload.

      [David Southwick] 11:30:06
      Yeah.

      [Paolo Calafiura (he)] 11:30:08
      So simulation tracking. We're not doing machine learning workloads yet.

      [Paolo Calafiura (he)] 11:30:11
      So that's that's something that's something which is not.

      [Enrico Fermi Institute] 11:30:13
      Okay.

      [Paolo Calafiura (he)] 11:30:14
      But we should we should, we should compare notes also because of this dimension.

      [Paolo Calafiura (he)] 11:30:18
      This is not only the workload it's the is, the is the software software platform you use, which makes 1 one configuration different from enough.

      [Paolo Calafiura (he)] 11:30:28
      Anyway. Bye. Shut up

      [David Southwick] 11:30:30
      Yeah, I know we can click offline

      [Enrico Fermi Institute] 11:30:35
      A quick question, How long do these benchmarks take to run

      [David Southwick] 11:30:38
      So if there is some of the the Gp ones can be fast in the mat in the order of I don't know 20 to 60 min from the some of the Cpu ones are much longer depending on what the experiment code owners have put forward.

      [David Southwick] 11:30:58
      As well what they you, as a representative set. So I think the default block or Cbu only, and the current release candidate is something like forward to 6 h.

      [David Southwick] 11:31:13
      But that's many workloads they do back to back, and you run 3 3 iterations of each to get an average and get rid of outliers I'm not sure what that will look like or Gpu because all all the work was that talked about for Hpc that are

      [David Southwick] 11:31:32
      not sort of run 3 standard ones. These are optional things that you can elect to run on with the suites They're not included by deal.

      [David Southwick] 11:31:45
      They are available. You just have to use a little bit different configuration, which is included in the

      [Enrico Fermi Institute] 11:31:53
      Yeah, I guess the thing I'm wondering. And maybe this is just a broader general question for everybody here, I mean, is this the kind of thing that we want to start incorporating?

      [Enrico Fermi Institute] 11:32:02
      Integration process for Hpcs right? We go to, you know.

      [Enrico Fermi Institute] 11:32:07
      Stand up pro mudder, and then the next machine should make sure.

      [Enrico Fermi Institute] 11:32:11
      We run these benchmarks as part of that integration.

      [Enrico Fermi Institute] 11:32:13
      So we start getting, you know, the benchmark numbers in place, and then maybe that helps eventually with the pledging and that sort of thing

      [Dirk Hufnagel] 11:32:23
      I I can just say what we're doing right now, because even though they are optimistic, we still at the end of the year cannot compile the usage and and have spec with 6 to come just to have a comparison and to see what the what the big picture looks like and we just went through

      [Dirk Hufnagel] 11:32:40
      that exercise in 21 for all the Hpc.

      [Dirk Hufnagel] 11:32:44
      That we're using in the Us. And basically what I'm doing right now.

      [Dirk Hufnagel] 11:32:47
      I look at the Cpu, and compared to what?

      [Dirk Hufnagel] 11:32:49
      What others have, benchmark on the on, on, on, and usually you'll find a number where you can come up with a defensible.

      [Enrico Fermi Institute] 11:32:56
      The

      [Dirk Hufnagel] 11:32:58
      Have specs 6, but I mean, especially if you want to get to A If we have really pledging the resources with that, and it becomes relevant.

      [Dirk Hufnagel] 11:33:07
      I think we we need to run the benchmarks.

      [Dirk Hufnagel] 11:33:09
      Maybe maybe not right now. But but but if once we get to that point, I think we we should to get a better number.

      [Enrico Fermi Institute] 11:33:17
      Yeah I mean, it seems like we ought to have a plan for it. I mean, even if even if the content of the benchmarks themselves change over time just to have that kind of in our minds, and kind of in the pipeline for one we integrate these resources.

      [Dirk Hufnagel] 11:33:17
      Okay.

      [Dirk Hufnagel] 11:33:30
      We have a couple of race since Linkedin. Maybe we should get these comments

      [Enrico Fermi Institute] 11:33:34
      Yeah, Okay, Andrew set his hand up

      [Andrew Melo] 11:33:39
      yeah, I was just gonna point out, I mean, so so Dirk talked a little bit earlier about you know, like, How do you?

      [Andrew Melo] 11:33:45
      How do you you know account for the Gpus and in the Hep score?

      [Andrew Melo] 11:33:52
      And you know, he suggested, maybe you give like a 20% bonus, or something like that.

      [Andrew Melo] 11:33:55
      I think that I think that what makes sense, and I argued this at the Heps for meeting last week is that you can't really benchmark machines with just one single scalar anymore.

      [Enrico Fermi Institute] 11:34:00
      Cool.

      [Andrew Melo] 11:34:04
      Right? So I think you're just gonna have to be some sort of tuple, you know, per machine to have these different accelerators on it.

      [Andrew Melo] 11:34:11
      And I was gonna also point out that that while you know we're we're working on, I guess you can call it hip score 22 with with run 3 work close the the number that pops out of he score right now is it waited

      [Andrew Melo] 11:34:26
      after. It's just the feature. So, at least initially, what will be pledged You know the this head score unit that will be pledged will will only be taken into account the the Cpu which one.

      [David Southwick] 11:34:29
      Okay.

      [David Southwick] 11:34:37
      Thanks, Andrew, and I'd like to add on to that, since we do have this reporting automated reporting, and it gives you the the Json.

      [David Southwick] 11:34:47
      Okay, all of the work ones. Yes, it will give you a single say, Hepscore value, but it also gives you the value.

      [Enrico Fermi Institute] 11:34:49
      Yes.

      [David Southwick] 11:34:55
      Every workload, So if you once like, you're just interested in htt, or whatever it is, you can get the number on that were on that machine for that benchmark, and you can go compare just then the benchmarks you're interested in this is also already available.

      [David Southwick] 11:35:14
      So it is a bit, I guess, in that way

      [David Southwick] 11:35:20
      A bit more fine grain than than what we had in the past.

      [Enrico Fermi Institute] 11:35:25
      Okay, question or comment from Ian

      [Ian Fisk] 11:35:29
      yeah, it was only that I It was second, that I think it's valuable to be using the benchmarks as we begin to commission the multiple Hpc Sites also I think, in addition to having the the benchmark that tells you a number about how well they're

      [Ian Fisk] 11:35:43
      performing. I'm wondering, if it also starts a purpose as sort of what we used to think of as the site availability tests for some things There's a diversity in the workflows and that if they all succeed and give reasonable numbers.

      [Ian Fisk] 11:35:54
      You also have a reasonable expectation that the site is pretty well configure.

      [Ian Fisk] 11:35:58
      Again.

      [Andrew Melo] 11:36:03
      so so So a fun anecdote about that, email, you know, we we actually did see, some of this where someone was bench benchmarking Dean and saw the you know this machine.

      [Andrew Melo] 11:36:15
      That you know they They knew what the head score should be, for that machine was about half or you know, 75% of what they were seeing, and it turns out that the cooling of that racket failed and the machine, was actually power throttling.

      [Andrew Melo] 11:36:24
      So it was something that yeah, people are able to say, Hey, this machine isn't working right just from looking at these numbers

      [Ian Fisk] 11:36:30
      alright.

      [David Southwick] 11:36:33
      Yep.

      [Ian Fisk] 11:36:33
      Yeah, I think the the other thing that the as we in the Commission, probably more applicable to Hpc.

      [Ian Fisk] 11:36:40
      Than cloud is that these machines are much more complicated than we're.

      [Ian Fisk] 11:36:44
      They're not as sort of simple as a pile of essentially x 86 servers.

      [Ian Fisk] 11:36:48
      They tend to have more complex services, whether cooling or interconnect, or whatever; And so I'm a more detailed set of benchmarking meetings

      [Enrico Fermi Institute] 11:36:59
      Okay.

      [Enrico Fermi Institute] 11:37:02
      So you You have in one of your sites. You mentioned that you know you do the uploading of all the results, and you also mentioned that you have some kind of batch uploader for portal.

      [Enrico Fermi Institute] 11:37:13
      Secure, workers. Does. That mean that this would work on Lcfs where the workers don't have any, you know.

      [Enrico Fermi Institute] 11:37:18
      Outbound connectivity would just batch it and upload it from the login nodes.

      [Enrico Fermi Institute] 11:37:21
      Is that is that the idea

      [David Southwick] 11:37:23
      Exactly. You know. Sites like this are that's common, but they're not uncommon, either.

      [David Southwick] 11:37:30
      There's several that we've been working with here in Europe that have a similar configuration, and normally the default case for this is just running a you know the base case is to run it on single node for vendors.

      [David Southwick] 11:37:43
      Or whatever it is, and when the runs are finished it'll compile the report, and then send it over and queue.

      [David Southwick] 11:37:50
      But if you don't have connectivity on the machine that you benchmark, then you can collect these.

      [David Southwick] 11:37:59
      these Jason afterward and do a batch reporting basically Yeah, from a from a gateway note

      [Enrico Fermi Institute] 11:38:06
      Okay, great.

      [Enrico Fermi Institute] 11:38:10
      Other questions for David

      [Enrico Fermi Institute] 11:38:18
      Okay, Thank you, David.

      [David Southwick] 11:38:19
      Yep thanks.

      [Enrico Fermi Institute] 11:38:25
      The slides again

      [Dirk Hufnagel] 11:38:32
      yeah, so.

      [Dirk Hufnagel] 11:38:36
      So when you, when when we look at that process, so assume we we get some benchmark.

      [Dirk Hufnagel] 11:38:40
      Now some defensible numbers, and we figure out how we gonna deal with the Cpu problem and work out how we pledge this with W.

      [Dirk Hufnagel] 11:38:48
      Sog At that point accounting become goes from Nice to have to.

      [Dirk Hufnagel] 11:38:55
      Actually we have to justify what we're using, and to show that we're actually fulfilling the patch.

      [Dirk Hufnagel] 11:39:01
      And at that point accounting becomes from from from it comes mandatory. Right now.

      [Dirk Hufnagel] 11:39:07
      It's optional, because we want to know what we're using.

      [Dirk Hufnagel] 11:39:10
      But when when the numbers start to matter, when we actually have a pledge in place, then we need to show that we actually deliver that pledge and the current situation is is that Cms, is, is made a push last year or made an effort last year, to to get all the accounting data

      [Dirk Hufnagel] 11:39:29
      pushed to apple. We still have some problems there with some sites where we're using like multi note jobs where the the system isn't quite aware.

      [Dirk Hufnagel] 11:39:38
      That there's actually multiple notes behind it. It thinks it's one.

      [Dirk Hufnagel] 11:39:41
      But but that's some technical, difficulties we're working on and in principle things are are connected Atlas doesn't currently, do.

      [Dirk Hufnagel] 11:39:51
      This cloud usage is is an open question. What, Fernando? You're not currently pushing the your cloud users data to to apple right or to any great accounting portal

      [Fernando Harald Barreiro Megino] 11:40:03
      no, I'm not this. Okay, apple that is written in the slides.

      [Fernando Harald Barreiro Megino] 11:40:11
      That's a solution that, for example, Ryan Taylor from University to Victoria, implemented because he's using a similar model as so using the cloud his private cloud and there he did this K.

      [Fernando Harald Barreiro Megino] 11:40:25
      Upper, because he needed to push the resources. So there is some solution.

      [Fernando Harald Barreiro Megino] 11:40:31
      But I don't have experience with that. And I'm not using it at the moment.

      [Fernando Harald Barreiro Megino] 11:40:35
      And it's also not applicable to. For example, Hpcs

      [Dirk Hufnagel] 11:40:41
      Okay, and and and on top of accounting, there's also monitoring like operational monitoring.

      [Dirk Hufnagel] 11:40:49
      We were doing this already, but depending on what integration method you pick for an Hpc.

      [Dirk Hufnagel] 11:40:56
      this can be tricky. For instance, in in the Us.

      [Enrico Fermi Institute] 11:40:56
      Okay.

      [Dirk Hufnagel] 11:41:01
      We we basically overlay a logical side on top of each Hbc facility, basically internally in Cms is this thing monitoring infrastructure.

      [Dirk Hufnagel] 11:41:16
      And the assumptions that it's it's build on.

      [Dirk Hufnagel] 11:41:19
      But, for instance, for the the Italian case, with the with the site extension to an to the Marconi, 100 Hpc that they've they chose a different model It's it's it's a site extension so basically, everything, is is that is under the

      [Dirk Hufnagel] 11:41:36
      T; one side umbrella, and then an accounting can be a bit tricky because you cannot use the the the site name as a as a dividing line between what resources are tier one What resources?

      [Dirk Hufnagel] 11:41:48
      Are on the Hpc. And then then you kind of have to look at sub so identify us that that basically divide this further into subsides and not all monitoring systems are basically geared to support that when we've done some work on that it's it's not Yeah, But

      [Dirk Hufnagel] 11:42:10
      this is the problem. And in in cloud, so far, at least, Atlas is, is, is doing their own separate side, and ponder.

      [Dirk Hufnagel] 11:42:21
      So this this problem doesn't come up. Also, the Cms scale tests basically overlay a separate side on the cloud resources.

      [Dirk Hufnagel] 11:42:29
      but you could also, imagine if if you do, seamless extension of like, if it's 2 decides that they want to support extension of the batch resources in the cloud, that issue will also come.

      [Dirk Hufnagel] 11:42:40
      Up. I mean, if if they want to do separate accounting, that is

      [Enrico Fermi Institute] 11:42:46
      Okay.

      [Dirk Hufnagel] 11:42:47
      We don't have anyone from Osg here, so I don't think we can get any comment on that

      [Enrico Fermi Institute] 11:42:53
      And

      [Enrico Fermi Institute] 11:43:00
      Other comments about accounting. I'm sure we move on

      [Dirk Hufnagel] 11:43:14
      Okay, Now, we actually have something about pledging, And we already talked about that, and in in the last 2 days, and I don't want to rehash that this discussion here we we talked about the difference between Ac and Dc.

      [Enrico Fermi Institute] 11:43:18
      Yes.

      [Dirk Hufnagel] 11:43:31
      And planning, like capacity, integrated capacity, and instantaneous capacity, and and the problems related to that and and how it, how it, how the scheduling of of Hbc and cloud impacts that yeah, also one thing to look is that they're not Hbc: and cloud resources

      [Dirk Hufnagel] 11:43:54
      they're not official osg. And Egi side, so we don't.

      [Dirk Hufnagel] 11:43:58
      We don't want to be Don't cut, gigs, tickets We talked, we talked about the G.

      [Dirk Hufnagel] 11:44:02
      You guys tickets already. And when we talked about the cost to support this resources, you need to set up some unit that supports them.

      [Enrico Fermi Institute] 11:44:06
      Okay.

      [Dirk Hufnagel] 11:44:10
      for instance, Cms has that team at formula. So anytime there's problem at the Us.

      [Dirk Hufnagel] 11:44:16
      Hbc. Sites. We get a ticket for that, and and the question here is, where do we see this going?

      [Dirk Hufnagel] 11:44:22
      And like, maybe not next year. Maybe maybe not even 2 years.

      [Dirk Hufnagel] 11:44:27
      But where where we are, where do we want to be in 5 years, like?

      [Dirk Hufnagel] 11:44:32
      When we asked, Let's say, just before Hrc.

      [Dirk Hufnagel] 11:44:35
      Starts, up, What's what's the goal here? And then we can.

      [Dirk Hufnagel] 11:44:40
      This requires discussion with Wsg

      [Douglas Benjamin] 11:44:48
      in both cases we don't own clouds or the Hpcs right

      [Dirk Hufnagel] 11:44:54
      Yeah, we don't, we? We don't own them, but we we basically what?

      [Douglas Benjamin] 11:44:59
      Therefore we are customers of them.

      [Dirk Hufnagel] 11:45:02
      What are we leasing them? I mean, in some sense it doesn't matter who actually owns the hardware.

      [Enrico Fermi Institute] 11:45:05
      Okay.

      [Dirk Hufnagel] 11:45:08
      It matters that you get guaranteed access in some way

      [Douglas Benjamin] 11:45:13
      Not necessarily.

      [Douglas Benjamin] 11:45:17
      Right. We are a customer

      [Dirk Hufnagel] 11:45:20
      Via customers Correct

      [Douglas Benjamin] 11:45:22
      So we have to deal with the interface layer. That is our community ready. I. E.

      [Dirk Hufnagel] 11:45:27
      But that's that that's support that's that goes into support that we, since we are customer, we we have to be the middleman for for supporting the the resources.

      [Douglas Benjamin] 11:45:27
      J. Goes.

      [Douglas Benjamin] 11:45:38
      And then the pledging right, but the pledging then comes from money that we get to provide the compute.

      [Dirk Hufnagel] 11:45:38
      So we we are the the interface to the to the experiment

      [Douglas Benjamin] 11:45:47
      If we have to provide to Atlas

      [Dirk Hufnagel] 11:45:49
      Yeah, the fundamental difference is that the, the, the entity that owns the resources is doesn't have a relationship with W.

      [Dirk Hufnagel] 11:45:59
      They basically they have only a relationship with us. And then we have a relationship with Wlg.

      [Dirk Hufnagel] 11:45:59
      they.

      [Douglas Benjamin] 11:46:09
      And do you expect that to change? And 5 years? I don't

      [Douglas Benjamin] 11:46:15
      He served it. Both Clouds and Hbc. Serve different.

      [Douglas Benjamin] 11:46:18
      The sites serve different masters different. They have different, You know the Hpcs in the Us.

      [Douglas Benjamin] 11:46:26
      Are responsible to Msf. And do we

      [Douglas Benjamin] 11:46:33
      Right. Am I missing something?

      [Steven Timm] 11:46:34
      Okay? Awesome.

      [Dirk Hufnagel] 11:46:35
      No But But why do we need to care? Because if we get an allocation for a 1 million like a 100 million hours, that's something we can use.

      [Dirk Hufnagel] 11:46:45
      We'd have no guarantees when we can use it.

      [Dirk Hufnagel] 11:46:48
      But well, we get something that we'll have this over a period of time.

      [Steven Timm] 11:46:50
      And

      [Steven Timm] 11:46:56
      And at least in the Us. And we know for the funding This is the way we're going.

      [Steven Timm] 11:47:00
      They're not going to be funding the level of blab on computers that they have great.

      [Steven Timm] 11:47:06
      So we're ready to evolve beyond resources that they own and work on a regular basis.

      [Steven Timm] 11:47:13
      And the ones that they don't want. They don't own, I mean smaller experiments have been doing this forever, just running up basically everywhere.

      [Steven Timm] 11:47:24
      Zoom account

      [Dirk Hufnagel] 11:47:27
      Oh, hello! You had to comment on this discussion

      [Paolo Calafiura (he)] 11:47:29
      the yes, just to say that I don't think it's particularly productive to discuss what what the you know, one support and what the the the agencies, what the agencies long.

      [Paolo Calafiura (he)] 11:47:46
      Term plan we need to be ready for what they are telling us now, which is that they want us to use the secondary services, But that that's that wasn't the reason I raised my end I wanted I wanted to bring up.

      [Paolo Calafiura (he)] 11:48:00
      Another angle, and I don't know if this is just the natural single, or or if you guys, have a senior concept.

      [Paolo Calafiura (he)] 11:48:05
      So. And this is has to do with this capacity. It's versus power that we have been There's that the or what versus power that we've been discussing.

      [Paolo Calafiura (he)] 11:48:15
      So in Atlas we have the concept of pledge, and beyond pledge resources, and I I I don't even wanna try to tell you why we have this distinction but to, the the it is historical.

      [Paolo Calafiura (he)] 11:48:31
      But the the reality is that the pledge in our class is sufficient to process data more than sufficient to process day time to produce a meeting of simulation that would probably be sufficient to process data 2 to one analyze data, but we do rely on a very substantial amount of beyond

      [Paolo Calafiura (he)] 11:48:54
      pledge resources, which is a which is taking it taken into account, is measured, and the end is not technically pledged, so there is no distinction in my mind.

      [Paolo Calafiura (he)] 11:49:07
      Between a tier, 2 delivering twice as many resources as they as they are supposed to, and the an Hpc.

      [Paolo Calafiura (he)] 11:49:16
      Delivering this delivering the same resources. So the question, the question is, is this concept of pledge absolutely fundamental?

      [Paolo Calafiura (he)] 11:49:27
      Or is it something which is, you know? We start together, we deal with it, and then and then we treat the Hpcs and the you know any resource which which can deliver resources on not on a constant basis.

      [Paolo Calafiura (he)] 11:49:42
      But on a on a opportunistic way just this, beyond pledge

      [Dirk Hufnagel] 11:49:50
      I mean, we're doing this now, That's how we're treating Hpc: Now the the question is going going forward.

      [Dirk Hufnagel] 11:49:58
      If we manage to be able. Let's see, we we manage to to get the Lcf.

      [Dirk Hufnagel] 11:50:04
      Working, and we can run like, really, lo, super, large scale. Yeah, So we figured out, is this going to be okay?

      [Dirk Hufnagel] 11:50:15
      Because at that point it's going to be a much larger fraction of it.

      [Dirk Hufnagel] 11:50:18
      Might be a much larger fraction of overall resources Is this beyond pledge models still working, then, in this case.

      [Paolo Calafiura (he)] 11:50:22
      Hmm.

      [Paolo Calafiura (he)] 11:50:29
      I I I would say, we'll cross the bridge if we can to it.

      [Paolo Calafiura (he)] 11:50:33
      But you know, in India and the India it's I guess, What I'm saying is that I'm not sure it is a particularly important distinction right now.

      [Paolo Calafiura (he)] 11:50:44
      That resources pledges are not pledged, That's that's I guess what I'm trying to say.

      [Paolo Calafiura (he)] 11:50:52
      And yeah, you are right. If if we end up ending the 75% of our resources being the non pledge, then it's weird.

      [Paolo Calafiura (he)] 11:51:00
      Yes, Oh, we have in perfect

      [Dirk Hufnagel] 11:51:01
      I mean. And also Brian made a good argument on on Monday is at some point. So far we're looking at this from our viewpoint at some point it might become a problem for the agencies, because they want credit for it so that could become an issue

      [Paolo Calafiura (he)] 11:51:21
      Well, we do, at least, I'm sure you do the same, at least in Atlas we do.

      [Paolo Calafiura (he)] 11:51:24
      We do a knowledge with that content

      [Dirk Hufnagel] 11:51:26
      Yes, we reported, but but that, as far as Wcg.

      [Dirk Hufnagel] 11:51:31
      Is concerned they're not. I don't know. There's second class resources, so I don't know how much they've met us to the agency

      [Paolo Calafiura (he)] 11:51:37
      Well, I think it's far. It's part of one of these is concerned that Wcg.

      [Paolo Calafiura (he)] 11:51:42
      Does a monster

      [Paolo Calafiura (he)] 11:51:47
      yes.

      [Dirk Hufnagel] 11:51:58
      What about the the computing plans?

      [Douglas Benjamin] 11:52:01
      and can I ask them the question another way? In the next 5 years?

      [Douglas Benjamin] 11:52:04
      Will we need these to meet our pledge, Kevin.

      [Douglas Benjamin] 11:52:10
      Our current given sort of flat funding

      [Douglas Benjamin] 11:52:17
      Because you said a 5 year timeline

      [Dirk Hufnagel] 11:52:20
      Yeah, that's what I That's I had a similar thought, a question that basically because pledge pledge means something in terms of what you can plan.

      [Dirk Hufnagel] 11:52:30
      Right. I mean the you base your planning on on pledge and beyond pledge is something you could add extra.

      [Enrico Fermi Institute] 11:52:30
      Okay.

      [Dirk Hufnagel] 11:52:37
      So

      [Dirk Hufnagel] 11:52:41
      If that extra becomes required to to what you need to do as a baseline, doesn't, it need to be included in the pledge

      [Enrico Fermi Institute] 11:53:02
      okay.

      [Dirk Hufnagel] 11:53:11
      I guess no one has an answer for that.

      [Enrico Fermi Institute] 11:53:19
      Just need more time to think about

      [Dirk Hufnagel] 11:53:21
      Yeah, I mean, it's I don't think I mean, this is this is future.

      [Dirk Hufnagel] 11:53:26
      This is I think these are The questions should go into report.

      [Dirk Hufnagel] 11:53:30
      But it's it's and then it's it's anyway outside the scope of this workshop. And maybe even this report in general, these these kinda discussions on this topic.

      [Douglas Benjamin] 11:53:42
      But how much labor do we want, or for beyond pledge activity, how much labor is acceptable versus excessive

      [Douglas Benjamin] 11:53:57
      In other words, if if it takes 3 ftees to do 3% of the Monte Carlo Atlas needs us contribution, then you might consider that excessive

      [Dirk Hufnagel] 11:54:10
      Holy, you! You wanna weigh in on this

      [Oliver Gutsche] 11:54:13
      Well, I let let me try, so I I think the agencies are also trying to optimize their budget.

      [Oliver Gutsche] 11:54:24
      Right? So in the end that the agencies need to enable us to do our site.

      [Oliver Gutsche] 11:54:30
      So, if if the agencies have the possibility to say, okay, instead of getting you all the money to on all your sites, some of your processing will come from reliable allocations on Hpc: then the question is, do they fulfill the requirements to be

      [Oliver Gutsche] 11:54:52
      acknowledged as this official contribution to the experiment, and then it becomes a cost question right as as you asked, how many fts is reasonable.

      [Oliver Gutsche] 11:55:05
      It depends, then, how much money, how much funds you would actually save by this approach.

      [Oliver Gutsche] 11:55:12
      So So I think the question that we might have to answer is, How much does it cost us to pledge Hpc.

      [Oliver Gutsche] 11:55:23
      Resources, and that then goes into the calculation of us.

      [Oliver Gutsche] 11:55:30
      And the agencies of how use, and how how how efficient it is to actually pledge Hpc resources for our purposes.

      [Enrico Fermi Institute] 11:55:32
      Okay.

      [Oliver Gutsche] 11:55:39
      So that would be very interesting assessment for us.

      [Enrico Fermi Institute] 11:55:45
      Well, and and all of the

      [Oliver Gutsche] 11:55:46
      I don't know if I make Yeah, sure. Sorry. Go ahead.

      [Enrico Fermi Institute] 11:55:50
      No, I'm just gonna say that. But then the other half of that is, if if if and you're right, that we will have to answer the question at some point, how much does it cost to provide X resources from Hbc's but then we also need to be ready for the immediate other question which

      [Enrico Fermi Institute] 11:56:03
      is, going to be. How much does it cost for us to provide those same resources on premises?

      [Enrico Fermi Institute] 11:56:09
      I guess right.

      [Oliver Gutsche] 11:56:11
      yeah. But I mean for for the letter question, I mean, we we we have 15 years of experience to do that. Right?

      [Oliver Gutsche] 11:56:21
      So so so for me, for for for this this exercise.

      [Oliver Gutsche] 11:56:26
      It's really about. If we want to have Hpc resources, all commercial class resources replace pledges that we normally would provide through our sites.

      [Oliver Gutsche] 11:56:37
      What would it actually cost

      [Paolo Calafiura (he)] 11:56:51
      one thing, one thing which I guess I cannot do to this is that, and to clarify what I was saying before, is, that I think I think that what we need right now, and that's why that the new benchmarking is so important.

      [Paolo Calafiura (he)] 11:57:06
      What we need right now is A is a reliable way to do accounting.

      [Paolo Calafiura (he)] 11:57:13
      And so, for example, we can then answer the question. The tag was asking, Is it worth having?

      [Paolo Calafiura (he)] 11:57:18
      3 fps to get 3% of that of the all good resources I I don't want to use the word pledge, though nonplex with us is forget about that.

      [Enrico Fermi Institute] 11:57:27
      okay.

      [Paolo Calafiura (he)] 11:57:28
      If if I have, if a site is giving me 3% of the resources, how much money, how much effort, and therefore money do I need? That?

      [Paolo Calafiura (he)] 11:57:38
      Is it worth it? Put into it so that I think the the problem is that right now we do not.

      [Paolo Calafiura (he)] 11:57:46
      Well, we we set also the problem. That is for that we do not.

      [Paolo Calafiura (he)] 11:57:52
      We do not have basically any work. Folks running on on Lcs, because we don't have any accelerated workloads really in production.

      [Paolo Calafiura (he)] 11:57:57
      But the assuming we do add them, then we need that.

      [Paolo Calafiura (he)] 11:58:01
      Then we need the hey? Hey? Cool way to measure what is the contribution?

      [Paolo Calafiura (he)] 11:58:08
      Could be the number of events. But then, what? Of what? Oh, what kind, what kind of events is?

      [Paolo Calafiura (he)] 11:58:15
      That is it? The full simulation simulation, reconstruction.

      [Paolo Calafiura (he)] 11:58:19
      So it is for me that the critical question is the accounting is not the pledging.

      [Paolo Calafiura (he)] 11:58:24
      I'm not okay. Just to reformulate what I was saying before

      [Enrico Fermi Institute] 11:58:40
      okay.

      [Dirk Hufnagel] 11:58:46
      I couldn't do. We wanna move on. I don't think that's I mean we have many questions, but since it concerns the future as it's expected.

      [Enrico Fermi Institute] 11:58:49
      Yeah.

      [Enrico Fermi Institute] 11:58:55
      Need. Yeah, So the next slide was just placeholder for that.

      [Dirk Hufnagel] 11:59:02
      And then another question from the charge was, what new facility features the policies would help use.

      [Dirk Hufnagel] 11:59:09
      Atlas and Usc. Adopt Commercial Cloud and Hbc.

      [Dirk Hufnagel] 11:59:13
      Resources, one thing that we had in here was security Don't we invited some security folks, But I don't think anyone actually managed to connect.

      [Dirk Hufnagel] 11:59:24
      So that's a little bit of a pity. But one big problem.

      [Dirk Hufnagel] 11:59:31
      Why, apart from the elsef restriction with the no outbound Internet, from the worker notes also most of the Hpc these days have some saw of Alright Mfa login procedure, so you cannot really connect from the outside to to the to desk systems without going through some

      [Enrico Fermi Institute] 11:59:37
      Yeah.

      [Dirk Hufnagel] 11:59:52
      mfa process, and that usually would mean that we cannot really integrate things in an automated provisioning systems and things like that.

      [Dirk Hufnagel] 12:00:01
      But mfa some Hbc's a bit more flexible.

      [Dirk Hufnagel] 12:00:06
      What they allow Mfa to meet like at at Lcfs.

      [Enrico Fermi Institute] 12:00:08
      Yeah.

      [Dirk Hufnagel] 12:00:12
      It's basically strict hardware talking. So phone apps, so you can't do anything unless you and any kind of outside connection goes through that step.

      [Dirk Hufnagel] 12:00:20
      So cannot be automated. The Nsfunded Hbc.

      [Dirk Hufnagel] 12:00:26
      So far at least more forgiving. They'll they can say, Okay, M.

      [Dirk Hufnagel] 12:00:30
      If it can mean that that the system that locks in remotely is comes from a certain Ip or or or things like that or they allow Mfa to be bypassed at the moment, still in general as as a policy, question and then fernando you want to say something on

      [Dirk Hufnagel] 12:00:47
      the the cloud issues with the with the Cas

      [Fernando Harald Barreiro Megino] 12:00:51
      yeah. So they are. The problem. Is that so? The Oh, the group on Amazon.

      [Fernando Harald Barreiro Megino] 12:01:01
      They use the their own certificate authorities, and those are not trusted by Igtf.

      [Fernando Harald Barreiro Megino] 12:01:11
      And in particular, if you want to do a third party transfer, it means great spite.

      [Fernando Harald Barreiro Megino] 12:01:15
      The cloud, The cloud ca is not trusted, and then the transfer phase, and you need to do some.

      [Fernando Harald Barreiro Megino] 12:01:23
      I put something in front of the this storage with another certificate, and this is it it?

      [Fernando Harald Barreiro Megino] 12:01:34
      It it can become a bottleneck. It could be preferable if if the third party transfers would work, This is being discussed already in the Wcg.

      [Fernando Harald Barreiro Megino] 12:01:47
      But I heard that they that probably there will not be a solution in the next.

      [Fernando Harald Barreiro Megino] 12:01:55
      Yes, but this is some to long term problem

      [Enrico Fermi Institute] 12:02:02
      So? Is the is the issue that that the the Wcg.

      [Enrico Fermi Institute] 12:02:05
      Would need to accept the the certificate authorities of the of the Commercial Cloud Providers

      [Fernando Harald Barreiro Megino] 12:02:12
      Yes, they would have to become part of the ittf, and I don't know exactly what are the the policies to get into the igts, and I understand it also require some effort from the from the Cloud Button for them.

      [Fernando Harald Barreiro Megino] 12:02:31
      Maybe it's not. It's not worth it. So that's why there is.

      [Fernando Harald Barreiro Megino] 12:02:37
      There is not really a solution. This will tell me, as far as I understand.

      [Enrico Fermi Institute] 12:02:43
      I'll just point out that it's but the ittf is is bigger than just the Wcg.

      [Enrico Fermi Institute] 12:02:47
      So it's not necessarily Wcg. Has to. It's a It's a step above them that we would have to convincing to do it

      [Dirk Hufnagel] 12:03:02
      Oh! And that I just wanted to. I skipped the Federated Identity point because I think, without a security person from Lap here, there's no point discussing that I I just to to mention what it is before I to Eric that basically, the labs have been working on, kind of

      [Dirk Hufnagel] 12:03:24
      federating the systems in terms of logins, and and so on.

      [Dirk Hufnagel] 12:03:28
      but I'm not sure if this will help, because I, as far as I know, Mfa, is still required on top of that, so I might be able to log into an to argon with my firmelap id but I still would have to go for the mfa step as far as I know I I just don't

      [Dirk Hufnagel] 12:03:44
      know if that's something that eventually could dropped with some, maybe private networks between the the the the national apps.

      [Dirk Hufnagel] 12:03:51
      But that's I wanted to get seek back from it.

      [Dirk Hufnagel] 12:03:54
      Security. I guess we have to offline Eric

      [Eric Lancon] 12:03:58
      yes, I wanted to comment. I have 2 points. So when you come to me fair identification and cloud cas, that's a price to pay when you don't own the resources.

      [Eric Lancon] 12:04:16
      So instead of complaining, we should find innovative solution whenever it's possible to use better those solutions on the Federated it.

      [Eric Lancon] 12:04:30
      I I invited the people online for this meeting.

      [Eric Lancon] 12:04:35
      That's so. If there are no the feminine people, or we can have a the Dnn perspective Your home is on the line wanted to to remind that me Fair will certainly become a Stand up everywhere.

      [Eric Lancon] 12:04:56
      Well, it's a 5 years time scale, so we should adapt to and foresee that we'd have to work with it.

      [Eric Lancon] 12:05:06
      So on, federated id If you want to say a few words, but

      [Jerome Lauret (he/him)] 12:05:10
      cool, sure. Let Let Let me write away. Jump in the nfa topic and close.

      [Jerome Lauret (he/him)] 12:05:19
      This in general, any services that we have put in place, that are cloud inspired, or port provide access for a wide variety of people from multiple different organization.

      [Jerome Lauret (he/him)] 12:05:35
      Required. Mfa. There is. There's been no no escape to it So far we have been able to set up.

      [Jerome Lauret (he/him)] 12:05:43
      Of course you know a few services, and Gp. To Herb, and things like that.

      [Jerome Lauret (he/him)] 12:05:47
      But has been essentially the prerequisite.

      [Jerome Lauret (he/him)] 12:05:53
      do you the other comment on federated Id is that, of course it all depends on the Federation, and we have been basically somewhat allowed to proceed with many of the Trusted Federation.

      [Jerome Lauret (he/him)] 12:06:10
      Of what Dewey sees it as trusted. So that is, for example, people coming from all the national labs.

      [Jerome Lauret (he/him)] 12:06:18
      we have an exemption for certain federation as well.

      [Jerome Lauret (he/him)] 12:06:23
      but in general, for example, we have had a consistent message that Google, for example, is not a trusted and acceptable federation and the reason for that is just that anybody anytime can impersonate anyone And and a live here on our side to we saw something kind of funny funny identity so

      [Jerome Lauret (he/him)] 12:06:47
      that's all I wanted to say. But of course there's a lot of Paul guys already seeing the the fact that we are being told.

      [Jerome Lauret (he/him)] 12:06:55
      Okay, please. Proceed with directed, Id is very encouraging. But this is a long road, and I think that early on someone mentioned also that you know the and Mfa.

      [Jerome Lauret (he/him)] 12:07:09
      Or Sfa could be bypassed in some ways.

      [Jerome Lauret (he/him)] 12:07:10
      That's yes, that's true. And there is a lot of work here.

      [Enrico Fermi Institute] 12:07:12
      Okay.

      [Jerome Lauret (he/him)] 12:07:13
      to also add trusted metadata. You know, as part of the certificate; but this is indeed not an immediate E immediate, development.

      [Jerome Lauret (he/him)] 12:07:24
      So that's why, perhaps right now do we prefer to have only trusted Federation, just to be sure that you know everybody is essentially the same kind of whole of engagement

      [Enrico Fermi Institute] 12:07:40
      So? There. 1 One question I had is, you know, security perspective.

      [Enrico Fermi Institute] 12:07:45
      Is is it? Is it the case that Mfa is fundamentally analogous to the that you want to have some human interaction to authenticity with a resource, or you know does the mfa.

      [Enrico Fermi Institute] 12:08:04
      Just mean that you you know, you really just need these multiple facts, right?

      [Enrico Fermi Institute] 12:08:09
      It's it's not sufficient just to have a key or password, or whatever you need to have some additional factor

      [Jerome Lauret (he/him)] 12:08:15
      Cool. Cool. Go back. It's a second. It's a simple, and in fact, you know in some cases well, even required to have a secret handshake.

      [Jerome Lauret (he/him)] 12:08:27
      I mean, okay, essentially, know the point of contact from within the experiment that you are in order to have an account, and being approved because this is the level of confidence depends, of course, on the service that you access just yeah, just to be sure that you you you see the difference.

      [Enrico Fermi Institute] 12:08:30
      And

      [Enrico Fermi Institute] 12:08:35
      Sure.

      [Enrico Fermi Institute] 12:08:41
      Sure.

      [Jerome Lauret (he/him)] 12:08:46
      Is that if you access matter most, for example and book, even your federated Id is enough.

      [Jerome Lauret (he/him)] 12:08:53
      If you are issue accounts something that essentially allows you to make modification to the content.

      [Jerome Lauret (he/him)] 12:09:02
      Then me phase require you, and if you do access computing resource right now, things where you can eventually launch a large number of jobs that so the knee fumble Kevin, that's in the up here as being some kind of illegal activity then not only it's an but in order to have your account

      [Jerome Lauret (he/him)] 12:09:20
      approved. You need some extra steps and verification that we came up with a procedure that was actually acceptable by outside the team.

      [Jerome Lauret (he/him)] 12:09:29
      but you know, so there's some kind of like jail of acceptance

      [Enrico Fermi Institute] 12:09:35
      So so I guess then, is it fair to say that Mfa.

      [Enrico Fermi Institute] 12:09:40
      Does not. Fundamentally, you know, schoolude automation

      [Jerome Lauret (he/him)] 12:09:47
      I would say

      [Enrico Fermi Institute] 12:09:47
      Renton what you know that you know, For example, Tak has done where they where they consider, a you know, a trusted machine to be a factor on top of you know the the key that you provide

      [Jerome Lauret (he/him)] 12:10:01
      Right. So actually, actually, this is a excellent question, because, of course, for example, jobs, submission from a trusted host.

      [Jerome Lauret (he/him)] 12:10:11
      Right, for example, in Osg Land has been, of course, accepted right so.

      [Enrico Fermi Institute] 12:10:16
      Okay.

      [Jerome Lauret (he/him)] 12:10:17
      And that indeed, is somewhat what you are hinting, as you know, that host is trusted, and you know to access that host to submit.

      [Jerome Lauret (he/him)] 12:10:30
      Then you require, Then you have, additional you know, authentication.

      [Jerome Lauret (he/him)] 12:10:35
      You understand what I'm saying right? I mean you you Log, for example, to that host using your local credential.

      [Jerome Lauret (he/him)] 12:10:40
      Then you eventually issue a token, or whatever which is yet a second factor of then use to meet your job, and that that has been accepted for quite a while.

      [Jerome Lauret (he/him)] 12:10:51
      So you you You are right, that there may be some leeway there, some some home in in that sense.

      [Jerome Lauret (he/him)] 12:10:58
      Okay.

      [Enrico Fermi Institute] 12:11:00
      I think

      [Enrico Fermi Institute] 12:11:09
      Other other comments or questions about mfa or Cloud. Ca's httf sort of thing

      [Dale Carder] 12:11:13
      I know a nurse. There. There's a process to get long term keys instead of like the default.

      [Dale Carder] 12:11:18
      24 h key for Ssh. Proxy

      [Dirk Hufnagel] 12:11:21
      Yeah, it's it's up to a month, I think, to support

      [Dale Carder] 12:11:24
      Yeah.

      [Dirk Hufnagel] 12:11:27
      But you do need to get that key. You need to go through an mfa process, and then you're okay.

      [Enrico Fermi Institute] 12:11:34
      Sure.

      [Dirk Hufnagel] 12:11:34
      For 30 days. They They basically that's that's the compromise between.

      [Dirk Hufnagel] 12:11:37
      We don't want to allow automation when no one authenticates for a couple of years, and then an age will.

      [Dirk Hufnagel] 12:11:42
      If the that key gets compromised, then basically everyone can use it forever to you. At least you don't have to do this meeting, at least operational feasible.

      [Dirk Hufnagel] 12:11:53
      To go to to use, to use the system, even with the Mfa. Rules.

      [Dirk Hufnagel] 12:11:57
      Some place.

      [Dirk Hufnagel] 12:12:01
      And I mean globus online, is the same. That's what we're doing with the the rush. You online.

      [Dirk Hufnagel] 12:12:06
      Integration for the transfers, someone actually has to log in manually to the portal and renewed a key once a week so that the transfers can keep keep going But it's it's I mean once a week, once a month that's all that just means you roll it into cost it's

      [Enrico Fermi Institute] 12:12:15
      Right.

      [Dirk Hufnagel] 12:12:23
      it's it's it's a bump on the cost for the operations for the long term.

      [Dirk Hufnagel] 12:12:26
      Information

      [Jerome Lauret (he/him)] 12:12:29
      usually when you have those long term credentials, so you also have to demonstrate that you have a way to revoke it first.

      [Dirk Hufnagel] 12:12:41
      I don't know how to handle it. You would probably have to go through nurse, because I don't think the you can revoke it yourself.

      [Jerome Lauret (he/him)] 12:12:48
      yeah, exactly. So this may be a concern in a long term.

      [Jerome Lauret (he/him)] 12:12:52
      Yes, I'm just saying, in terms of visibility, people may not know the detail, but usually that's one of the things that would come when a long credentials appeal

      [Robert Hancock] 12:13:05
      yeah, and in our plan, with the vote, right? So the long-standing credentials would state the Volt.

      [Robert Hancock] 12:13:11
      Server, so we could just delete them from there, and then they wouldn't be able to pull any more short term credentials, you know.

      [Robert Hancock] 12:13:15
      Short term tokens, we access token

      [Enrico Fermi Institute] 12:13:27
      Okay.

      [Dirk Hufnagel] 12:13:31
      if you don't have any more comments on security topics, it could me move on to the allocations and that's more like acquiring the resources.

      [Enrico Fermi Institute] 12:13:42
      Okay.

      [Dirk Hufnagel] 12:13:45
      So an Hbc: you do it through. It's currently yearly.

      [Dirk Hufnagel] 12:13:48
      Our locations. So we are. It was mentioned already in the Hbc. Focus area discussions.

      [Dirk Hufnagel] 12:13:55
      This: if you had multi year locations, that would you reduce that, hey?

      [Enrico Fermi Institute] 12:13:55
      Okay.

      [Dirk Hufnagel] 12:14:00
      Reduce the effort to acquire the Hbc resources because you wouldn't have to constantly rejustified every year and B.

      [Dirk Hufnagel] 12:14:09
      It would also open up possibilities to include sizable Hbc.

      [Dirk Hufnagel] 12:14:15
      allocations in the in the planning process which you can do right now, because at the moment you write the proposal, you get the decision, and then, usually on the order of few months, later 1, 1, 2, few months later, you get that you actually have the resources, and you don't actually get the decision, until a few

      [Dirk Hufnagel] 12:14:37
      months before which is too late to actually include it in a name of the long term, in in in the long term, planning planning process for research use in the experiments, and the that's that's from that side independently of of any kind of pledging problems we have that's that's

      [Dirk Hufnagel] 12:14:55
      a That's a problem in in and being able to pledge.

      [Dirk Hufnagel] 12:14:58
      I mean, if you don't know that we have the resources we cannot pledge it, even if there would be procedures in place to be able to do so technically, and then what was mentioned also many times before is that a large stored allocations with connectivity to the white area network would

      [Dirk Hufnagel] 12:15:14
      allow, would simplify Hbc operations, which was basically would make some sinks possible.

      [Dirk Hufnagel] 12:15:23
      that might not be possible now, and it definitely would reduce the cost

      [Enrico Fermi Institute] 12:15:29
      Yeah, okay.

      [Dirk Hufnagel] 12:15:30
      And on the cloud side of Anando. You want to say something on the subscription model

      [Fernando Harald Barreiro Megino] 12:15:37
      well, I mean I'm not sure exactly how thing works.

      [Fernando Harald Barreiro Megino] 12:15:44
      At the end. The cloud, that's as long as they they get the check on the on the subscription is renewed, and also the there are come on like there is a common understanding out what is going to be the cost of the subscription.

      [Fernando Harald Barreiro Megino] 12:16:02
      It's okay. But then I don't know in in in Atlas how how the budgeting works will prepare those a yearly parts it for that

      [Dirk Hufnagel] 12:16:18
      But I'm really curious about what is what will happen after the it what is it?

      [Dirk Hufnagel] 12:16:24
      Fifth, 15 months.

      [Fernando Harald Barreiro Megino] 12:16:25
      Yes, it's around October 2023.

      [Fernando Harald Barreiro Megino] 12:16:29
      That's fine.

      [Dirk Hufnagel] 12:16:30
      We'll we'll see what I mean. I really would like to see what happens.

      [Dirk Hufnagel] 12:16:32
      Sand. If they just renew it at the same, or if they actually drilling into the into the billing data that they collect and and do some, I mean, it depends, I guess, on the billing data.

      [Enrico Fermi Institute] 12:16:40
      Okay.

      [Dirk Hufnagel] 12:16:46
      But but still I I'm curious.

      [Dirk Hufnagel] 12:16:52
      And then some specific topics, Here, on the Hpc.

      [Dirk Hufnagel] 12:16:56
      Side we we mentioned that the facilitating Cbm Fs access.

      [Dirk Hufnagel] 12:17:01
      I think this is small as a self problem, because see what as as Brian said, if excess is, is considered kind of stable, and the solution to provide.

      [Enrico Fermi Institute] 12:17:03
      Okay.

      [Dirk Hufnagel] 12:17:12
      Cdfs these days. So in that that basically immediately gets you to the second problem, you need to have the ability to or either have a some squid infrastructure in place, or the ability for to launch our own because that supports Cdm Fs exec And then there's frontier on top of it but

      [Dirk Hufnagel] 12:17:33
      the first of all access at facilities access to software. Oh, is it as a comment

      [John Steven De Stefano Jr] 12:17:42
      Yeah, I was just wondering in general, on the Hpc side when it comes to Cbn Fs access.

      [John Steven De Stefano Jr] 12:17:48
      What the the mean issue is with the native client. Is it just connectivity that's restricted on the

      [Dirk Hufnagel] 12:17:54
      It's usually that they don't want to install custom software for just one customer

      [John Steven De Stefano Jr] 12:17:59
      But they'll, and they'll use Cbm physics sick

      [Dirk Hufnagel] 12:18:02
      Well, that runs completely in user space. The latest versions It's it's becoming increasing to I mean, I've reached.

      [John Steven De Stefano Jr] 12:18:05
      True

      [Dirk Hufnagel] 12:18:10
      We've started using it like a year or 2 ago, like 2 years ago, and it's becoming increasingly easier to use it because the newer machines run new operating systems with newer kernel features.

      [Enrico Fermi Institute] 12:18:16
      Yeah.

      [Dirk Hufnagel] 12:18:21
      And it's basically with this at this level, you can run it completely in user space.

      [Dirk Hufnagel] 12:18:27
      You don't you? Really? The system dependencies are so small these days. If the kernel is new enough that that it kind of it just works

      [Enrico Fermi Institute] 12:18:36
      In in the past at least, you know, justified or not.

      [Enrico Fermi Institute] 12:18:39
      There there was certainly some paranoia I I have seen about, you know, running fuse file systems on a compute note.

      [Enrico Fermi Institute] 12:18:48
      some some sites were worried about that. Like doesn't have any

      [Dirk Hufnagel] 12:18:53
      Yeah, yeah, some sites, yeah, some some. Hpc: size, you log into a batch node, and like fuse amount, is not available.

      [Dirk Hufnagel] 12:19:00
      But that's not a If the kernel is new enough.

      [Dirk Hufnagel] 12:19:02
      See me, Xxx doesn't need fuse amount binary to do a fuse mount.

      [Dirk Hufnagel] 12:19:07
      You can do it directly through Yeah.

      [John Steven De Stefano Jr] 12:19:11
      Sure, and I understand the concern about views being another layer on top of, and already simplified or complex system.

      [John Steven De Stefano Jr] 12:19:18
      But I think the native client has proven fairly stable lately, so I understand the concerns.

      [Enrico Fermi Institute] 12:19:24
      So it's just it's just convincing the sites.

      [Enrico Fermi Institute] 12:19:26
      That's the case. Okay.

      [John Steven De Stefano Jr] 12:19:27
      Thanks.

      [John Steven De Stefano Jr] 12:19:27
      Yeah.

      [Dirk Hufnagel] 12:19:33
      And then another area. The Hbc. If you have it, makes so much simpler, if you, if they would provide right, rushio compatible storage we are we're currently working with with nurse on that Lcf: I don't think I mean there's there's no efforts

      [Dirk Hufnagel] 12:19:51
      there, and I I'm not sure if it will ever happen.

      [Dirk Hufnagel] 12:19:54
      But at least they support globus online. So we do have a grocery club gloves online integration.

      [Dirk Hufnagel] 12:20:01
      So it's it's doable

      [Douglas Benjamin] 12:20:05
      So then we just have to make sure that we call out that there's a hop that's required

      [Dirk Hufnagel] 12:20:12
      Yeah, I mean, we. We tried. I don't know if you ever tried.

      [Dirk Hufnagel] 12:20:15
      The multi hop. We We tried it through, nurse, and it just worked

      [Dirk Hufnagel] 12:20:21
      So we had nurse. Currently We have nurse currently integrated via the still existing good Ftp.

      [Dirk Hufnagel] 12:20:25
      Integration which will eventually go away. But it's still there, for now and then both nurse and hey, I'll see a data where integrated into be a global. Online. And then basically it's it's you can configure.

      [Dirk Hufnagel] 12:20:41
      Configure the roo system. So that when you, when you put in a rule, we create some data, theta it automatically. First, that's a good ftp trap transfer to nurse and then immediately global online transfer from nurse to theta

      [Douglas Benjamin] 12:21:02
      but will that work? When nurse goes to the next generation, or globus

      [Dirk Hufnagel] 12:21:07
      No, but that's that's what the work with nurse, where the work with nurse on the d interface is important, because that eventually hopefully will replace the good fft integration which which is deprecated since many years and will go away

      [Douglas Benjamin] 12:21:28
      So Sam, us is planning to keep nurse in the Hpc data flow path for essentially Nsf.

      [Douglas Benjamin] 12:21:39
      And other doe. Hpcs

      [Douglas Benjamin] 12:21:44
      Versus putting Fermi lab in the path, so that Fermi lab becomes the connector

      [Dirk Hufnagel] 12:21:45
      Yes, I'm

      [Dirk Hufnagel] 12:21:53
      where we exactly put the multi hop is still to be decided, and ask, is is an obvious candidate.

      [Dirk Hufnagel] 12:21:58
      But we also, I think we have tattoos with global online licenses.

      [Dirk Hufnagel] 12:22:02
      So that would be some that would be an alternative option that we have

      [Douglas Benjamin] 12:22:11
      because I have less uses, you know, as the Hop

      [Dirk Hufnagel] 12:22:14
      Yeah.

      [Dirk Hufnagel] 12:22:23
      I mean the at the moment, at the level of transfers, we need to do to the to the Lcf.

      [Enrico Fermi Institute] 12:22:29
      See.

      [Dirk Hufnagel] 12:22:30
      It's it's not that important where the hop location is.

      [Dirk Hufnagel] 12:22:33
      If we scale up Lcf Usage and we really looking at the future.

      [Dirk Hufnagel] 12:22:38
      That's like heavy on like data reconstruction, or so then it becomes, and more important question, because that's potentially a lot of traffic you have to multi-hop

      [Enrico Fermi Institute] 12:22:54
      Okay.

      [Dirk Hufnagel] 12:23:08
      and then we had a point here that we on on network traffic and pairings to improve connectivity and reduce the limited e rate cost.

      [Dirk Hufnagel] 12:23:18
      I think we had an interesting presentation from me, as not yesterday, about the connectivity side of things.

      [Dirk Hufnagel] 12:23:22
      I don't think we got anywhere with the reduce unlimited egret cost.

      [Dirk Hufnagel] 12:23:26
      That's kinda that's more. That's not so much a question of the how this, how the the networks are connected, and what peering how the peering is set up but more question of what type of cost model you have you have a subscription or you use a cloud that doesn't

      [Dirk Hufnagel] 12:23:42
      have egret. Yeah, it seems that's the outcome I get out of this Berkshire

      [Dirk Hufnagel] 12:23:50
      And then there's an open anded question, What else is there anything?

      [Dirk Hufnagel] 12:23:54
      We forgot to cover here that that could help with the without our Hbc.

      [Enrico Fermi Institute] 12:24:12
      and I think that's that's what we have for this session.

      [Enrico Fermi Institute] 12:24:18
      Call current standards.

      [Enrico Fermi Institute] 12:24:23
      Yeah, So you know, if there are other things that we that we should talk about, for you know, facility, features, and policies, or you know any of the the topics that we covered in previous days I think it, would be a good time, to bring them up now

      [Enrico Fermi Institute] 12:24:44
      otherwise we can. We can go about the session a little early

      [Paolo Calafiura (he)] 12:24:54
      when do we reconnect? If we

      [Enrico Fermi Institute] 12:24:57
      the next session will be at at one o'clock central time.

      [Paolo Calafiura (he)] 12:25:03
      Okay.

      [Enrico Fermi Institute] 12:25:17
      seeing people disconnect. So maybe we'll just go ahead and and and close out, and then resume in an hour and a half

      [Dirk Hufnagel] 12:25:23
      sounds good

      [Enrico Fermi Institute] 12:25:24
      Okay, So it's guessing

      [Paolo Calafiura (he)] 12:25:24
      hi folks.

      [Fernando Harald Barreiro Megino] 12:25:25
      okay.

       

      • 10:00
        Accounting / Pledging 30m

        [Eastern Time]

         

        ACCOUNTING SLIDES

        [Enrico Fermi Institute] 11:00:03
        Cool Dale is indeed here

        [Enrico Fermi Institute] 11:00:25
        We'll wait a couple more minutes for folks to filter in

        [Enrico Fermi Institute] 11:01:41
        oh!

        [Enrico Fermi Institute] 11:02:29
        e

        [Enrico Fermi Institute] 11:02:37
        Yes, slides

        [Enrico Fermi Institute] 11:03:24
        okay, we're gonna get here started here just a minute.

        [Enrico Fermi Institute] 11:03:28
        So thanks everyone for making it to the to the final day of the workshop goal.

        [Enrico Fermi Institute] 11:03:35
        For this morning will be to do a discussion about number of topics, one which is counting the other would be to continue our pledging discussion.

        [Enrico Fermi Institute] 11:03:50
        assuming, you know, of all the appropriate people are here for that.

        [Enrico Fermi Institute] 11:03:53
        another topic. We wanted to cover was security, you know, both on the on the clouds and the Hpcs.

        [Enrico Fermi Institute] 11:04:02
        And I think ultimately the the stuff that we cover this morning will inform for poly recommendations that we would, that we would make as part of the report in the afternoon.

        [Enrico Fermi Institute] 11:04:17
        I don't expect us to take the full 2 h may not even take the full 2 h this morning.

        [Enrico Fermi Institute] 11:04:26
        but in the afternoon we'll talk about the We'll have a presentation from the vera ribbon folks, and then we'll I think we have a couple other minor topics any other, business, and and sort of next steps for work to go with with the us.

        [Enrico Fermi Institute] 11:04:47
        see So about 20 people. Online So that's probably good to going here.

        [Enrico Fermi Institute] 11:04:56
        So yeah.

        [Enrico Fermi Institute] 11:05:01
        Dirk, did you want to make comments about this one?

        [Dirk Hufnagel] 11:05:03
        yeah, I I can. I can. Get the started. So as as with previous slides, so that the green is is a question that's directly copied from the charge and the the question, was what can the us at less than us, Cms: recommend to the collaboration to improve the utilization of commercial and Sbc

        [Dirk Hufnagel] 11:05:23
        resources, and both in terms of enabling more work for so we can run there, and also improving, and also improving the cost effectiveness of using these resources, and and one thing that that was all that already came up in the previous days was but we'll have a dedicated slot

        [Dirk Hufnagel] 11:05:44
        for here, so we'll see how much excellent discussion we're good is that at the moment these resources are nice to have say opportunistically, they give us some free cycles where we can run some things but they're not included.

        [Dirk Hufnagel] 11:05:58
        In the pledge, so we don't get full credit for them, and we don't them down included in the planning.

        [Dirk Hufnagel] 11:06:06
        another. The thing is that to make use of sites like the Lcf.

        [Enrico Fermi Institute] 11:06:08
        Let's

        [Dirk Hufnagel] 11:06:12
        Specifically, but also other opportunities that are available in cloud.

        [Enrico Fermi Institute] 11:06:15
        Okay.

        [Dirk Hufnagel] 11:06:16
        And then our own good sides potentially, we need Gpu payloads.

        [Dirk Hufnagel] 11:06:20
        We need some way to utilize, deployed Gpu resources on the cloud side We had a lot of discussion yesterday on that in the cost section and doing the networking the the pick, big worry a big part of the where you're on cost on cloud is egress.

        [Enrico Fermi Institute] 11:06:24
        See.

        [Enrico Fermi Institute] 11:06:27
        Okay.

        [Dirk Hufnagel] 11:06:41
        So minimizing egress or negotiating some way, be the peering agreements.

        [Dirk Hufnagel] 11:06:49
        Either subscription. Where this cost is basically removed or targeting cloud resources that don't have egress charges.

        [Dirk Hufnagel] 11:06:57
        We basically want to avoid the the egress costs that currently are, they're not dominating.

        [Dirk Hufnagel] 11:07:03
        But they are large factor of the whole. Cloud cost calculation, and on the Hbc side the focus will be on reducing operational overheads, especially on the Lcf.

        [Dirk Hufnagel] 11:07:19
        Side, where this is still a little bit of an on D area in terms of figuring out what the what the final operational model will look like, and then another way to reach cost and make make it easier to operate hpc resources is to get sizable storage allocations because it makes

        [Dirk Hufnagel] 11:07:39
        something simpler.

        [Dirk Hufnagel] 11:07:47
        We could move on, Maybe at this point, if you're talking about accounting, which is, which is part of like having this these types of resources be a fully equivalent player, I mean, pledging is one side but then you need to take care of of accounting to make sure that you actually deliver what

        [Dirk Hufnagel] 11:08:06
        you promise to deliver. We have a comment. I guess we can.

        [Dirk Hufnagel] 11:08:10
        We can do that right right away

        [Enrico Fermi Institute] 11:08:14
        Yeah, feel free to jump in at any time.

        [Dirk Hufnagel] 11:08:16
        Eric

        [Eric Lancon] 11:08:17
        Yes, good morning. Before we change the slide. I think the first recommendation would be that Atlas and Cms both work on the

        [Eric Lancon] 11:08:35
        Having a Gpu, or whatever or accelerator find the payloads. Because if you don't have a generic software, you you would not be able to use at hi resources

        [Eric Lancon] 11:08:56
        And if they are very restricted, these resources to a specific kind of application, let's say, just specific.

        [Eric Lancon] 11:09:07
        evolution, because it's suited to to Hey, Gpus, it There would be no way to get them plagued, because pledges are as they, are for now.

        [Enrico Fermi Institute] 11:09:13
        Okay.

        [Eric Lancon] 11:09:19
        It's for all will work close together. It's not greatest by type of workflow, so

        [Dirk Hufnagel] 11:09:28
        Yes, that that's that's one problem, And we can discuss this later that that the moment the pledges I mean, you can.

        [Enrico Fermi Institute] 11:09:33
        Sure.

        [Dirk Hufnagel] 11:09:34
        You can pledge across Cpu architecture. You can see that because you can.

        [Dirk Hufnagel] 11:09:39
        You can run your normalized, whatever this be. Right now. Hs: 0 6 in the future will be half score.

        [Dirk Hufnagel] 11:09:45
        You can run your benchmark, you get your Cpu speed it.

        [Dirk Hufnagel] 11:09:49
        It includes some averaging for the airbus, but at least it's in principle possible.

        [Dirk Hufnagel] 11:09:54
        Gpu is more tricky, because how do you account for that?

        [Dirk Hufnagel] 11:09:56
        Do you give like a 20% bonus? Do you look at the most commonly used workflow?

        [Dirk Hufnagel] 11:10:02
        And I think that's a that's something we can't decide here.

        [Dirk Hufnagel] 11:10:06
        That's something we'll have to discuss with Wlcg how they once we have these, these, these workloads I can use Gpu and get a benefit from them.

        [Dirk Hufnagel] 11:10:16
        How does the the how do you factor? That in and the performance normalization?

        [Dirk Hufnagel] 11:10:21
        Is there gonna be some extra factors? Is a different category that you pledge?

        [Dirk Hufnagel] 11:10:25
        I I don't know. I don't have the answers, but it's it's an area that that needs to be discussed.

        [Dirk Hufnagel] 11:10:30
        But Wlcg

        [Eric Lancon] 11:10:33
        Yes, but the recommendation to the experiments would be first to actually right.

        [Enrico Fermi Institute] 11:10:39
        The

        [Eric Lancon] 11:10:41
        The development of a Gpu friendly software

        [Dirk Hufnagel] 11:10:45
        Yes, I mean, we're doing that, I mean, there are 2 avenues and cms at least at 2 avenues, and and that's that's the machine.

        [Dirk Hufnagel] 11:10:53
        Learning that we talked about yesterday, which is that that might be the most common at the moment between Cms and Atlas, because similar training frameworks, and so on.

        [Dirk Hufnagel] 11:11:04
        That you could use in the framework itself. For obvious reasons.

        [Dirk Hufnagel] 11:11:08
        This is somewhat distinct from each other. But we're both looking at that.

        [Dirk Hufnagel] 11:11:12
        Cms: maybe is a little bit more advanced than Atlas because of the the the push from the hlt, and the deployment of Gpu Also resources there.

        [Dirk Hufnagel] 11:11:21
        But I'm sure that Atlas is also at least looking at this

        [Enrico Fermi Institute] 11:11:31
        Okay.

        [Dirk Hufnagel] 11:11:36
        Lincoln. Can you go to the next slide?

        [Dirk Hufnagel] 11:11:39
        If that comment was addressed. So, looking at accounting, so maybe at this point, this is a good time.

        [Dirk Hufnagel] 11:11:46
        We have an invited contribution for benchmarking, because for accounting and for pledging one of the prerequisites is that you have to know what you're actually pledging in terms not just not in terms of course, but in terms, of some normalized numbers, hs, 6 or in the

        [Dirk Hufnagel] 11:12:05
        future have score, so that there that Maria offered contributed Talk about benchmarking of Hbc I think

        [Enrico Fermi Institute] 11:12:06
        Good.

        [Dirk Hufnagel] 11:12:16
        And David wanted to give the toocacy connected

         

        HPC BENCHMARKS PRESENTATION

        [David Southwick] 11:12:20
        yeah, I am. Can you hear me?

        [Dirk Hufnagel] 11:12:21
        Okay, Yeah, Do you want to? Okay, Do you want to share the slides or otherwise Link can also share

        [Enrico Fermi Institute] 11:12:22
        Yeah, we'll I'm not sharing anything

        [David Southwick] 11:12:25
        Okay. Just like my headphones have

        [David Southwick] 11:12:35
        Just a second here

        [David Southwick] 11:12:38
        Okay, you can still hear me. Great. So I do have just a few short slides.

        [Enrico Fermi Institute] 11:12:41
        We can.

        [Enrico Fermi Institute] 11:12:45
        Okay, okay.

        [David Southwick] 11:12:45
        That I'd be happy to hear

        [David Southwick] 11:12:49
        See if I can do this

        [Enrico Fermi Institute] 11:12:51
        Sure.

        [David Southwick] 11:12:59
        Yes, hmm.

        [David Southwick] 11:13:06
        Okay.

        [David Southwick] 11:13:13
        Okay, Do you see the? So I do. Okay? Great: So Yeah, I've got a few comments, and just to share a bit by what we're doing.

        [Enrico Fermi Institute] 11:13:15
        We do. Yep. Looks good

        [David Southwick] 11:13:27
        With Hp. See with the head benchmarking Last couple of years I've been collaborating with the hex benchmarking working group really to take the this replacement model, for he speculated 6.

        [David Southwick] 11:13:49
        And really developed this that so that it can work on Hpc.

        [David Southwick] 11:13:52
        As well, cause I'm I'm sure, sure many people are very familiar with this, and how it was done in the past.

        [David Southwick] 11:13:57
        It was meant to be as as similar as possible to the bare metal works.

        [David Southwick] 11:14:04
        Notes that Wcg. Was using. So it was Vms, or at some point nested containers.

        [David Southwick] 11:14:11
        And things like this that are no way compatible with Hbc: So, but in a bunch of work to to make this as lightweight and user friendly as possible, So it's totally rootless.

        [Enrico Fermi Institute] 11:14:23
        Okay.

        [David Southwick] 11:14:27
        Now, we're using switch to singularity images, and then also a bunch of quality of life things that that allow you to use it on sites that have, you know, don't have wide area in networking tool Okay, things like that So it's really, been a big big it for

        [David Southwick] 11:14:46
        the last year or so a bit about the sweet self.

        [David Southwick] 11:14:51
        I'm sure some of you are No, you're over.

        [David Southwick] 11:14:52
        Maybe I've run this already since we've been distributing say, proof of concept or release candidates here for the last couple of months.

        [David Southwick] 11:15:04
        basically it's well, I love this. I already shared, But it's now sort of flexible thing that, and you can run on any hardware.

        [David Southwick] 11:15:14
        You can see, I've got a small graphic on the right right here, or the suite itself is an orchestrator that will go and collect metadata of whatever hardware it's running on and control the array of benchmarks you want to use so on that the bottom

        [David Southwick] 11:15:30
        part is graphics. There's a couple different benchmarks had spec of 6 is one of them He score, which is the the the candidate replacement for it or I don't know if I can call it Canada anymore.

        [David Southwick] 11:15:46
        but you can easily plug in other mitch marks as well.

        [David Southwick] 11:15:49
        So this is the tool we've been using on Hbc: So a bit about that, This effort, like I said, started.

        [David Southwick] 11:15:59
        I guess more than a year ago. So I and the initial presentation of the Hpc.

        [David Southwick] 11:16:07
        Work that was during chat 21, and at that time we had just done large-scale deployments, so doing several 100,000 core campaigns, and at that time we were looking at comparing New Amd.

        [David Southwick] 11:16:22
        Mdc. Views that were available widely on Hpc.

        [David Southwick] 11:16:26
        Sites, but not yet widely accessible, and from so we did a comparison of that, and stability, studies and whatnot.

        [David Southwick] 11:16:37
        So that was interesting. The first step. What's happened since then is we've had a lot of software become available from the experiments.

        [David Southwick] 11:16:48
        obviously the first look at the run. 3 workloads, but along with that there's been a bunch of heterogeneous development from basically all of the experiments.

        [David Southwick] 11:16:57
        So and I mean heterogeneous, both in compile compile codes and in accelerators.

        [David Southwick] 11:17:06
        So we've got several workloads that are in development for power, and of course, Gpus So we've been using Hbc Then to take these these workloads, or let's say, snapshots.

        [David Southwick] 11:17:23
        Of them. They're sufficiently stable. I will containerize them in singularity, and then run them at scale on Hpc.

        [David Southwick] 11:17:32
        And that enables a lot of different interesting studies. So Gpu versus cpu, and then combined for some, we're close there, support that as well as more exotic combination.

        [David Southwick] 11:17:46
        So arm, plus gpu said Power plus gpu, and things like this.

        [David Southwick] 11:17:54
        And I know this was discussed. Excellent! Some point yesterday.

        [David Southwick] 11:17:56
        I think I was listening, in but in case these are now available via the benchmarking suites, and you can run it just as you would on a bare metal machine on Hbc.

        [David Southwick] 11:18:12
        I think you it was presentation yesterday from Eric about the in Lps: Yeah, I workload.

        [David Southwick] 11:18:20
        That's also containerized, and can be run at scale on each Pc.

        [David Southwick] 11:18:25
        however, the configuration they have at the moment is a the the snapshot of that is just single note.

        [David Southwick] 11:18:31
        At the moment. So if you do want to use that for Npi related, scaling, then you need to run just the workload container and not the the suite, because it well it's not the target at the moment of of Wcg to do Npi and as I mentioned I guess we've

        [David Southwick] 11:18:54
        got a lot of other quality of life things. So if you have local storage like Cdm Fs, you could take advantage of this instead of room, pull copies and whatnot.

        [David Southwick] 11:19:03
        But really, as has been set here many times there's a lot of interest in Gpus, and then you can see I got a small slice here in the number.

        [David Southwick] 11:19:15
        Of course you get from current and next generation Gpus.

        [David Southwick] 11:19:21
        So there's I love to going in that direction, and we know we're expecting close.

        [David Southwick] 11:19:30
        so, that being said, there isn't industry, standard Gpu benchmark yet, and at least from the benchmarking side of things.

        [David Southwick] 11:19:43
        We are kind of approaching it in the same way that we have Cpus.

        [David Southwick] 11:19:48
        Oh, you use workload set production workloads, or what will be production workloads, And we can generate the score in the same way that we do for head score which is some function of throughput or events.

        [David Southwick] 11:19:59
        Per second. So this is what we've been using so far with trying to understand the capabilities of a machine that is going to be running Gpu.

        [David Southwick] 11:20:13
        Only Cpu, and I mean there was a he score work last week, I think several people are probably part of, but a lot of discussion happened there as well of how to account for for these sorts of resources.

        [David Southwick] 11:20:34
        so how just concluded, saying that we we've been active on Hbc.

        [David Southwick] 11:20:38
        Benchmarking now for couple years. Use the suite because it's automated running and reporting of large scale.

        [Enrico Fermi Institute] 11:20:39
        okay.

        [David Southwick] 11:20:48
        You can do whole part or several partition benchmarks.

        [David Southwick] 11:20:52
        This includes exotic workloads both for machine learning.

        [David Southwick] 11:20:56
        Yeah, I as well as architectures as well as starting to look at.

        [David Southwick] 11:21:03
        Let's see, sort of the other services that you get on Hpc: I know it was mentioned yesterday that it issues with scaling.

        [David Southwick] 11:21:14
        Io bomb workloads. And how can you tell what what's good on a shared filer system that maybe you don't have any information about.

        [David Southwick] 11:21:24
        So we are starting to develop. And we've got a prototype.

        [David Southwick] 11:21:27
        Let's see Mark. I say benchmark kind of in quotes, because it's not benchmarking a a compute, unit, but testing the shared filer service.

        [David Southwick] 11:21:37
        And then from there giving you some feedback on both your workload and let's say, how many nodes you could scale that up to before where it starts locking up that the file system in some way.

        [David Southwick] 11:21:53
        So that's what's new with us, and a little bit of a peek into.

        [David Southwick] 11:21:57
        We're doing in where we're going and I'm happy to answer any questions

        [Enrico Fermi Institute] 11:22:06
        Okay, David. Thank you very much. So we have a couple of hands raised.

        [Enrico Fermi Institute] 11:22:10
        follow up! Good

        [Paolo Calafiura (he)] 11:22:12
        good morning, everyone. So the the first you said that you said that the there are no industry standard Ml.

        [Paolo Calafiura (he)] 11:22:23
        Benchmarks, and I think that is still accurate.

        [Paolo Calafiura (he)] 11:22:26
        But I want to meet. I want to be sure you guys are aware of Ml.

        [Paolo Calafiura (he)] 11:22:30
        Per which is becoming a little system, and it's not so.

        [Paolo Calafiura (he)] 11:22:36
        So okay. So that's that was the first. The first comment.

        [David Southwick] 11:22:36
        yeah.

        [Paolo Calafiura (he)] 11:22:41
        The second comment is that the and I mean, I I You guys have a very difficult job, because what we are seeing in Tc.

        [Paolo Calafiura (he)] 11:22:52
        Is that the same software? I mean the same out of it around on different software, on different platform, performs quite differently.

        [Paolo Calafiura (he)] 11:23:04
        So, if you have, a if you have a fast parameter or simulation, we should run that with alpaca, or with the kuda, or with the of well, cool, that is a different problem without Packer. With caucus, you'd get different different performance, of the same code in

        [Enrico Fermi Institute] 11:23:13
        Hmm.

        [Paolo Calafiura (he)] 11:23:22
        principle on day for different machines depending. What what portability layers to probability to you.

        [Paolo Calafiura (he)] 11:23:31
        So I wanted to ask you if you have a settled on a platform, for like parallelization platforms, but to use, or if you are taking the world, you know you're taking a mix so what what's your problem

        [David Southwick] 11:23:45
        so I don't think it's settled at the moment, and of course this is a a popular question of what sort of optimization targets you're using for your workloads.

        [David Southwick] 11:23:56
        And I mean not just for

        [David Southwick] 11:24:01
        Translating the Cross architecture. But even within the same families of of of units.

        [David Southwick] 11:24:09
        So at the moment, I I think the method we have is take the minimum compatibility, so that we don't having, you know, 1020 different versions of of the workplace Okay, want to have the same thing that we can run everywhere and and sort of take all of these variables out of the

        [Enrico Fermi Institute] 11:24:18
        Okay.

        [David Southwick] 11:24:30
        equation, But that being said, I don't think there's a clear answer on on the proper way to do that yet, and it's not not so.

        [David Southwick] 11:24:42
        Let's say.

        [Enrico Fermi Institute] 11:24:48
        Okay, hum. See, Steve has this hand raised

        [Steven Timm] 11:24:51
        Yeah, I was just wondering if the toolkit is available anywhere for download that have been

        [David Southwick] 11:24:58
        Oh, good, Yeah, Absolutely So the let's say I'll just go back.

        [David Southwick] 11:25:04
        I think I have a link to. And so the benchmarking suite itself, that is it. While all that is open source.

        [David Southwick] 11:25:13
        It's on git lab at certain. So flash, hip, benchmarks, and then the suite is the the project down there.

        [Steven Timm] 11:25:15
        oh, okay, I see it. Okay.

        [David Southwick] 11:25:22
        I have the link on the screen, the yeah, the other benchmark I was talking about, for let's say services.

        [David Southwick] 11:25:30
        This Io Benchmark. I don't have the link on there, but I can share this afterwards.

        [David Southwick] 11:25:36
        It's in a That's right. Prototype state right now.

        [David Southwick] 11:25:40
        We have been working on this year. It doesn't cover all the things you can throw at it yet.

        [David Southwick] 11:25:47
        But yeah, you can download it and play around with it.

        [David Southwick] 11:25:51
        And I should mention that the idea for this will really is from, I I think I don't know if he's in in the room now, but it I've seen, and the previous days so shut up. To.

        [David Southwick] 11:26:04
        Him, I guess.

        [Enrico Fermi Institute] 11:26:07
        Okay, Okay, go ahead.

        [Steven Timm] 11:26:10
        Hello! There was no different so they have Benchmark 3.

        [Steven Timm] 11:26:14
        That's no different than the when they were running on regular note.

        [Steven Timm] 11:26:17
        Then

        [David Southwick] 11:26:18
        Yeah, Yep: exactly.

        [Steven Timm] 11:26:20
        Okay? Good. And then there may be a newly special one.

        [David Southwick] 11:26:25
        Well, there, yeah, like I said, the there was a workshop last week on this discussing.

        [David Southwick] 11:26:33
        you know how to choose the final versions and the waiting and whatnot.

        [David Southwick] 11:26:37
        So there's a little bit more qualified people around, I think, to answer specific questions on on that.

        [David Southwick] 11:26:43
        But it's in progress. Yeah.

        [Enrico Fermi Institute] 11:26:47
        Okay.

        [David Southwick] 11:26:48
        There will be another version, I guess, with the the let's say, gold standard for okay for the Benchmark suite.

        [David Southwick] 11:26:55
        That's decided

        [Enrico Fermi Institute] 11:27:02
        Sure.

        [Dirk Hufnagel] 11:27:04
        yeah, I just had it quick. Question. You said, You're you're benchmarking cpu plus gpu and also, Gpu workloads.

        [Dirk Hufnagel] 11:27:13
        Now I can see. I mean hapscore, and have specs zeros on cpu that's like well established.

        [Dirk Hufnagel] 11:27:20
        You take a mix of experiment, specific workloads and average something, throws something together and get get some average.

        [David Southwick] 11:27:26
        Yep.

        [Dirk Hufnagel] 11:27:26
        I have. What do you do for the Gpu stuff?

        [Dirk Hufnagel] 11:27:30
        Because it's so early, and and the experiments I'll algorithm.

        [Dirk Hufnagel] 11:27:34
        I know Cms has something, but it's not complete.

        [Dirk Hufnagel] 11:27:36
        It's not a complete picture. Do you run synthetic stuff, or do you run the very early stuff?

        [Dirk Hufnagel] 11:27:42
        Because that's the only thing you can do

        [David Southwick] 11:27:43
        We this there. Yeah, So we're running very early stuff from Cms: There's an Lpf.

        [David Southwick] 11:27:50
        Which is a bit of a an exact bird, which Yeah, there was a talk yesterday on that.

        [David Southwick] 11:27:56
        we're we also are using hlt and then as well, the sort of rolling builds from check windows, from mangraph

        [Dirk Hufnagel] 11:28:11
        Okay, thanks.

        [David Southwick] 11:28:13
        I guess there's also some other exotic Gpu workloads, but these are from Beams Department.

        [David Southwick] 11:28:21
        So there's this simple track, or

        [David Southwick] 11:28:26
        I know ped track is in there as well, but I don't think we have a container for Patrack

        [Dirk Hufnagel] 11:28:31
        so it's it's early going. So the numbers you get might not necessarily be representative of ever we end up running in production later

        [David Southwick] 11:28:40
        Exactly so. I mean, there's a lot of results results already, since we you know, we use the the suite as a reporting tool as well.

        [David Southwick] 11:28:48
        So it's it all gets pushed up over Amq into Cabana, and then you can All the workloads are hashed so you can compared performance across every node that it's run on with the same version of the world Okay, so at least with.

        [David Southwick] 11:29:04
        You know that published version? If you have your own bill, let's say you can track You can compare device to device like this, but you're right.

        [David Southwick] 11:29:13
        I mean, these are really, or let's say, snapshot releases of some of these.

        [David Southwick] 11:29:19
        So they will change whenever it's decided that that's going to be a production.

        [David Southwick] 11:29:25
        Or let's say, a a final version validated in some way.

        [David Southwick] 11:29:28
        Yeah.

        [Paolo Calafiura (he)] 11:29:29
        so starting to jump in in Cc. We have assembled the kind of what we think is A is a cross-section of representative applications, and of course we have no, we have No.

        [Paolo Calafiura (he)] 11:29:42
        Standing is just like asking around. I wonder if we should compare now, and we can see if if we, if we pick the same, and under which configuration maybe we should have an offline discussion between our groups

        [David Southwick] 11:29:54
        Sure you're using workforce or like off the shell benchmarks

        [Paolo Calafiura (he)] 11:30:02
        No, no, we're using using a Hp workload.

        [David Southwick] 11:30:06
        Yeah.

        [Paolo Calafiura (he)] 11:30:08
        So simulation tracking. We're not doing machine learning workloads yet.

        [Paolo Calafiura (he)] 11:30:11
        So that's that's something that's something which is not.

        [Enrico Fermi Institute] 11:30:13
        Okay.

        [Paolo Calafiura (he)] 11:30:14
        But we should we should, we should compare notes also because of this dimension.

        [Paolo Calafiura (he)] 11:30:18
        This is not only the workload it's the is, the is the software software platform you use, which makes 1 one configuration different from enough.

        [Paolo Calafiura (he)] 11:30:28
        Anyway. Bye. Shut up

        [David Southwick] 11:30:30
        Yeah, I know we can click offline

        [Enrico Fermi Institute] 11:30:35
        A quick question, How long do these benchmarks take to run

        [David Southwick] 11:30:38
        So if there is some of the the Gp ones can be fast in the mat in the order of I don't know 20 to 60 min from the some of the Cpu ones are much longer depending on what the experiment code owners have put forward.

        [David Southwick] 11:30:58
        As well what they you, as a representative set. So I think the default block or Cbu only, and the current release candidate is something like forward to 6 h.

        [David Southwick] 11:31:13
        But that's many workloads they do back to back, and you run 3 3 iterations of each to get an average and get rid of outliers I'm not sure what that will look like or Gpu because all all the work was that talked about for Hpc that are

        [David Southwick] 11:31:32
        not sort of run 3 standard ones. These are optional things that you can elect to run on with the suites They're not included by deal.

        [David Southwick] 11:31:45
        They are available. You just have to use a little bit different configuration, which is included in the

        [Enrico Fermi Institute] 11:31:53
        Yeah, I guess the thing I'm wondering. And maybe this is just a broader general question for everybody here, I mean, is this the kind of thing that we want to start incorporating?

        [Enrico Fermi Institute] 11:32:02
        Integration process for Hpcs right? We go to, you know.

        [Enrico Fermi Institute] 11:32:07
        Stand up pro mudder, and then the next machine should make sure.

        [Enrico Fermi Institute] 11:32:11
        We run these benchmarks as part of that integration.

        [Enrico Fermi Institute] 11:32:13
        So we start getting, you know, the benchmark numbers in place, and then maybe that helps eventually with the pledging and that sort of thing

        [Dirk Hufnagel] 11:32:23
        I I can just say what we're doing right now, because even though they are optimistic, we still at the end of the year cannot compile the usage and and have spec with 6 to come just to have a comparison and to see what the what the big picture looks like and we just went through

        [Dirk Hufnagel] 11:32:40
        that exercise in 21 for all the Hpc.

        [Dirk Hufnagel] 11:32:44
        That we're using in the Us. And basically what I'm doing right now.

        [Dirk Hufnagel] 11:32:47
        I look at the Cpu, and compared to what?

        [Dirk Hufnagel] 11:32:49
        What others have, benchmark on the on, on, on, and usually you'll find a number where you can come up with a defensible.

        [Enrico Fermi Institute] 11:32:56
        The

        [Dirk Hufnagel] 11:32:58
        Have specs 6, but I mean, especially if you want to get to A If we have really pledging the resources with that, and it becomes relevant.

        [Dirk Hufnagel] 11:33:07
        I think we we need to run the benchmarks.

        [Dirk Hufnagel] 11:33:09
        Maybe maybe not right now. But but but if once we get to that point, I think we we should to get a better number.

        [Enrico Fermi Institute] 11:33:17
        Yeah I mean, it seems like we ought to have a plan for it. I mean, even if even if the content of the benchmarks themselves change over time just to have that kind of in our minds, and kind of in the pipeline for one we integrate these resources.

        [Dirk Hufnagel] 11:33:17
        Okay.

        [Dirk Hufnagel] 11:33:30
        We have a couple of race since Linkedin. Maybe we should get these comments

        [Enrico Fermi Institute] 11:33:34
        Yeah, Okay, Andrew set his hand up

        [Andrew Melo] 11:33:39
        yeah, I was just gonna point out, I mean, so so Dirk talked a little bit earlier about you know, like, How do you?

        [Andrew Melo] 11:33:45
        How do you you know account for the Gpus and in the Hep score?

        [Andrew Melo] 11:33:52
        And you know, he suggested, maybe you give like a 20% bonus, or something like that.

        [Andrew Melo] 11:33:55
        I think that I think that what makes sense, and I argued this at the Heps for meeting last week is that you can't really benchmark machines with just one single scalar anymore.

        [Enrico Fermi Institute] 11:34:00
        Cool.

        [Andrew Melo] 11:34:04
        Right? So I think you're just gonna have to be some sort of tuple, you know, per machine to have these different accelerators on it.

        [Andrew Melo] 11:34:11
        And I was gonna also point out that that while you know we're we're working on, I guess you can call it hip score 22 with with run 3 work close the the number that pops out of he score right now is it waited

        [Andrew Melo] 11:34:26
        after. It's just the feature. So, at least initially, what will be pledged You know the this head score unit that will be pledged will will only be taken into account the the Cpu which one.

        [David Southwick] 11:34:29
        Okay.

        [David Southwick] 11:34:37
        Thanks, Andrew, and I'd like to add on to that, since we do have this reporting automated reporting, and it gives you the the Json.

        [David Southwick] 11:34:47
        Okay, all of the work ones. Yes, it will give you a single say, Hepscore value, but it also gives you the value.

        [Enrico Fermi Institute] 11:34:49
        Yes.

        [David Southwick] 11:34:55
        Every workload, So if you once like, you're just interested in htt, or whatever it is, you can get the number on that were on that machine for that benchmark, and you can go compare just then the benchmarks you're interested in this is also already available.

        [David Southwick] 11:35:14
        So it is a bit, I guess, in that way

        [David Southwick] 11:35:20
        A bit more fine grain than than what we had in the past.

        [Enrico Fermi Institute] 11:35:25
        Okay, question or comment from Ian

        [Ian Fisk] 11:35:29
        yeah, it was only that I It was second, that I think it's valuable to be using the benchmarks as we begin to commission the multiple Hpc Sites also I think, in addition to having the the benchmark that tells you a number about how well they're

        [Ian Fisk] 11:35:43
        performing. I'm wondering, if it also starts a purpose as sort of what we used to think of as the site availability tests for some things There's a diversity in the workflows and that if they all succeed and give reasonable numbers.

        [Ian Fisk] 11:35:54
        You also have a reasonable expectation that the site is pretty well configure.

        [Ian Fisk] 11:35:58
        Again.

        [Andrew Melo] 11:36:03
        so so So a fun anecdote about that, email, you know, we we actually did see, some of this where someone was bench benchmarking Dean and saw the you know this machine.

        [Andrew Melo] 11:36:15
        That you know they They knew what the head score should be, for that machine was about half or you know, 75% of what they were seeing, and it turns out that the cooling of that racket failed and the machine, was actually power throttling.

        [Andrew Melo] 11:36:24
        So it was something that yeah, people are able to say, Hey, this machine isn't working right just from looking at these numbers

        [Ian Fisk] 11:36:30
        alright.

        [David Southwick] 11:36:33
        Yep.

        [Ian Fisk] 11:36:33
        Yeah, I think the the other thing that the as we in the Commission, probably more applicable to Hpc.

        [Ian Fisk] 11:36:40
        Than cloud is that these machines are much more complicated than we're.

        [Ian Fisk] 11:36:44
        They're not as sort of simple as a pile of essentially x 86 servers.

        [Ian Fisk] 11:36:48
        They tend to have more complex services, whether cooling or interconnect, or whatever; And so I'm a more detailed set of benchmarking meetings

        [Enrico Fermi Institute] 11:36:59
        Okay.

        [Enrico Fermi Institute] 11:37:02
        So you You have in one of your sites. You mentioned that you know you do the uploading of all the results, and you also mentioned that you have some kind of batch uploader for portal.

        [Enrico Fermi Institute] 11:37:13
        Secure, workers. Does. That mean that this would work on Lcfs where the workers don't have any, you know.

        [Enrico Fermi Institute] 11:37:18
        Outbound connectivity would just batch it and upload it from the login nodes.

        [Enrico Fermi Institute] 11:37:21
        Is that is that the idea

        [David Southwick] 11:37:23
        Exactly. You know. Sites like this are that's common, but they're not uncommon, either.

        [David Southwick] 11:37:30
        There's several that we've been working with here in Europe that have a similar configuration, and normally the default case for this is just running a you know the base case is to run it on single node for vendors.

        [David Southwick] 11:37:43
        Or whatever it is, and when the runs are finished it'll compile the report, and then send it over and queue.

        [David Southwick] 11:37:50
        But if you don't have connectivity on the machine that you benchmark, then you can collect these.

        [David Southwick] 11:37:59
        these Jason afterward and do a batch reporting basically Yeah, from a from a gateway note

        [Enrico Fermi Institute] 11:38:06
        Okay, great.

        [Enrico Fermi Institute] 11:38:10
        Other questions for David

        [Enrico Fermi Institute] 11:38:18
        Okay, Thank you, David.

        [David Southwick] 11:38:19
        Yep thanks.

         

        SLIDES AGAIN - ACCOUNTING

         


        [David Southwick] 11:12:20
        yeah, I am. Can you hear me?

        [Dirk Hufnagel] 11:12:21
        Okay, Yeah, Do you want to? Okay, Do you want to share the slides or otherwise Link can also share

        [Enrico Fermi Institute] 11:12:22
        Yeah, we'll I'm not sharing anything

        [David Southwick] 11:12:25
        Okay. Just like my headphones have

        [David Southwick] 11:12:35
        Just a second here

        [David Southwick] 11:12:38
        Okay, you can still hear me. Great. So I do have just a few short slides.

        [Enrico Fermi Institute] 11:12:41
        We can.

        [Enrico Fermi Institute] 11:12:45
        Okay, okay.

        [David Southwick] 11:12:45
        That I'd be happy to hear

        [David Southwick] 11:12:49
        See if I can do this

        [Enrico Fermi Institute] 11:12:51
        Sure.

        [David Southwick] 11:12:59
        Yes, hmm.

        [David Southwick] 11:13:06
        Okay.

        [David Southwick] 11:13:13
        Okay, Do you see the? So I do. Okay? Great: So Yeah, I've got a few comments, and just to share a bit by what we're doing.

        [Enrico Fermi Institute] 11:13:15
        We do. Yep. Looks good

        [David Southwick] 11:13:27
        With Hp. See with the head benchmarking Last couple of years I've been collaborating with the hex benchmarking working group really to take the this replacement model, for he speculated 6.

        [David Southwick] 11:13:49
        And really developed this that so that it can work on Hpc.

        [David Southwick] 11:13:52
        As well, cause I'm I'm sure, sure many people are very familiar with this, and how it was done in the past.

        [David Southwick] 11:13:57
        It was meant to be as as similar as possible to the bare metal works.

        [David Southwick] 11:14:04
        Notes that Wcg. Was using. So it was Vms, or at some point nested containers.

        [David Southwick] 11:14:11
        And things like this that are no way compatible with Hbc: So, but in a bunch of work to to make this as lightweight and user friendly as possible, So it's totally rootless.

        [Enrico Fermi Institute] 11:14:23
        Okay.

        [David Southwick] 11:14:27
        Now, we're using switch to singularity images, and then also a bunch of quality of life things that that allow you to use it on sites that have, you know, don't have wide area in networking tool Okay, things like that So it's really, been a big big it for

        [David Southwick] 11:14:46
        the last year or so a bit about the sweet self.

        [David Southwick] 11:14:51
        I'm sure some of you are No, you're over.

        [David Southwick] 11:14:52
        Maybe I've run this already since we've been distributing say, proof of concept or release candidates here for the last couple of months.

        [David Southwick] 11:15:04
        basically it's well, I love this. I already shared, But it's now sort of flexible thing that, and you can run on any hardware.

        [David Southwick] 11:15:14
        You can see, I've got a small graphic on the right right here, or the suite itself is an orchestrator that will go and collect metadata of whatever hardware it's running on and control the array of benchmarks you want to use so on that the bottom

        [David Southwick] 11:15:30
        part is graphics. There's a couple different benchmarks had spec of 6 is one of them He score, which is the the the candidate replacement for it or I don't know if I can call it Canada anymore.

        [David Southwick] 11:15:46
        but you can easily plug in other mitch marks as well.

        [David Southwick] 11:15:49
        So this is the tool we've been using on Hbc: So a bit about that, This effort, like I said, started.

        [David Southwick] 11:15:59
        I guess more than a year ago. So I and the initial presentation of the Hpc.

        [David Southwick] 11:16:07
        Work that was during chat 21, and at that time we had just done large-scale deployments, so doing several 100,000 core campaigns, and at that time we were looking at comparing New Amd.

        [David Southwick] 11:16:22
        Mdc. Views that were available widely on Hpc.

        [David Southwick] 11:16:26
        Sites, but not yet widely accessible, and from so we did a comparison of that, and stability, studies and whatnot.

        [David Southwick] 11:16:37
        So that was interesting. The first step. What's happened since then is we've had a lot of software become available from the experiments.

        [David Southwick] 11:16:48
        obviously the first look at the run. 3 workloads, but along with that there's been a bunch of heterogeneous development from basically all of the experiments.

        [David Southwick] 11:16:57
        So and I mean heterogeneous, both in compile compile codes and in accelerators.

        [David Southwick] 11:17:06
        So we've got several workloads that are in development for power, and of course, Gpus So we've been using Hbc Then to take these these workloads, or let's say, snapshots.

        [David Southwick] 11:17:23
        Of them. They're sufficiently stable. I will containerize them in singularity, and then run them at scale on Hpc.

        [David Southwick] 11:17:32
        And that enables a lot of different interesting studies. So Gpu versus cpu, and then combined for some, we're close there, support that as well as more exotic combination.

        [David Southwick] 11:17:46
        So arm, plus gpu said Power plus gpu, and things like this.

        [David Southwick] 11:17:54
        And I know this was discussed. Excellent! Some point yesterday.

        [David Southwick] 11:17:56
        I think I was listening, in but in case these are now available via the benchmarking suites, and you can run it just as you would on a bare metal machine on Hbc.

        [David Southwick] 11:18:12
        I think you it was presentation yesterday from Eric about the in Lps: Yeah, I workload.

        [David Southwick] 11:18:20
        That's also containerized, and can be run at scale on each Pc.

        [David Southwick] 11:18:25
        however, the configuration they have at the moment is a the the snapshot of that is just single note.

        [David Southwick] 11:18:31
        At the moment. So if you do want to use that for Npi related, scaling, then you need to run just the workload container and not the the suite, because it well it's not the target at the moment of of Wcg to do Npi and as I mentioned I guess we've

        [David Southwick] 11:18:54
        got a lot of other quality of life things. So if you have local storage like Cdm Fs, you could take advantage of this instead of room, pull copies and whatnot.

        [David Southwick] 11:19:03
        But really, as has been set here many times there's a lot of interest in Gpus, and then you can see I got a small slice here in the number.

        [David Southwick] 11:19:15
        Of course you get from current and next generation Gpus.

        [David Southwick] 11:19:21
        So there's I love to going in that direction, and we know we're expecting close.

        [David Southwick] 11:19:30
        so, that being said, there isn't industry, standard Gpu benchmark yet, and at least from the benchmarking side of things.

        [David Southwick] 11:19:43
        We are kind of approaching it in the same way that we have Cpus.

        [David Southwick] 11:19:48
        Oh, you use workload set production workloads, or what will be production workloads, And we can generate the score in the same way that we do for head score which is some function of throughput or events.

        [David Southwick] 11:19:59
        Per second. So this is what we've been using so far with trying to understand the capabilities of a machine that is going to be running Gpu.

        [David Southwick] 11:20:13
        Only Cpu, and I mean there was a he score work last week, I think several people are probably part of, but a lot of discussion happened there as well of how to account for for these sorts of resources.

        [David Southwick] 11:20:34
        so how just concluded, saying that we we've been active on Hbc.

        [David Southwick] 11:20:38
        Benchmarking now for couple years. Use the suite because it's automated running and reporting of large scale.

        [Enrico Fermi Institute] 11:20:39
        okay.

        [David Southwick] 11:20:48
        You can do whole part or several partition benchmarks.

        [David Southwick] 11:20:52
        This includes exotic workloads both for machine learning.

        [David Southwick] 11:20:56
        Yeah, I as well as architectures as well as starting to look at.

        [David Southwick] 11:21:03
        Let's see, sort of the other services that you get on Hpc: I know it was mentioned yesterday that it issues with scaling.

        [David Southwick] 11:21:14
        Io bomb workloads. And how can you tell what what's good on a shared filer system that maybe you don't have any information about.

        [David Southwick] 11:21:24
        So we are starting to develop. And we've got a prototype.

        [David Southwick] 11:21:27
        Let's see Mark. I say benchmark kind of in quotes, because it's not benchmarking a a compute, unit, but testing the shared filer service.

        [David Southwick] 11:21:37
        And then from there giving you some feedback on both your workload and let's say, how many nodes you could scale that up to before where it starts locking up that the file system in some way.

        [David Southwick] 11:21:53
        So that's what's new with us, and a little bit of a peek into.

        [David Southwick] 11:21:57
        We're doing in where we're going and I'm happy to answer any questions

        [Enrico Fermi Institute] 11:22:06
        Okay, David. Thank you very much. So we have a couple of hands raised.

        [Enrico Fermi Institute] 11:22:10
        follow up! Good

        [Paolo Calafiura (he)] 11:22:12
        good morning, everyone. So the the first you said that you said that the there are no industry standard Ml.

        [Paolo Calafiura (he)] 11:22:23
        Benchmarks, and I think that is still accurate.

        [Paolo Calafiura (he)] 11:22:26
        But I want to meet. I want to be sure you guys are aware of Ml.

        [Paolo Calafiura (he)] 11:22:30
        Per which is becoming a little system, and it's not so.

        [Paolo Calafiura (he)] 11:22:36
        So okay. So that's that was the first. The first comment.

        [David Southwick] 11:22:36
        yeah.

        [Paolo Calafiura (he)] 11:22:41
        The second comment is that the and I mean, I I You guys have a very difficult job, because what we are seeing in Tc.

        [Paolo Calafiura (he)] 11:22:52
        Is that the same software? I mean the same out of it around on different software, on different platform, performs quite differently.

        [Paolo Calafiura (he)] 11:23:04
        So, if you have, a if you have a fast parameter or simulation, we should run that with alpaca, or with the kuda, or with the of well, cool, that is a different problem without Packer. With caucus, you'd get different different performance, of the same code in

        [Enrico Fermi Institute] 11:23:13
        Hmm.

        [Paolo Calafiura (he)] 11:23:22
        principle on day for different machines depending. What what portability layers to probability to you.

        [Paolo Calafiura (he)] 11:23:31
        So I wanted to ask you if you have a settled on a platform, for like parallelization platforms, but to use, or if you are taking the world, you know you're taking a mix so what what's your problem

        [David Southwick] 11:23:45
        so I don't think it's settled at the moment, and of course this is a a popular question of what sort of optimization targets you're using for your workloads.

        [David Southwick] 11:23:56
        And I mean not just for

        [David Southwick] 11:24:01
        Translating the Cross architecture. But even within the same families of of of units.

        [David Southwick] 11:24:09
        So at the moment, I I think the method we have is take the minimum compatibility, so that we don't having, you know, 1020 different versions of of the workplace Okay, want to have the same thing that we can run everywhere and and sort of take all of these variables out of the

        [Enrico Fermi Institute] 11:24:18
        Okay.

        [David Southwick] 11:24:30
        equation, But that being said, I don't think there's a clear answer on on the proper way to do that yet, and it's not not so.

        [David Southwick] 11:24:42
        Let's say.

        [Enrico Fermi Institute] 11:24:48
        Okay, hum. See, Steve has this hand raised

        [Steven Timm] 11:24:51
        Yeah, I was just wondering if the toolkit is available anywhere for download that have been

        [David Southwick] 11:24:58
        Oh, good, Yeah, Absolutely So the let's say I'll just go back.

        [David Southwick] 11:25:04
        I think I have a link to. And so the benchmarking suite itself, that is it. While all that is open source.

        [David Southwick] 11:25:13
        It's on git lab at certain. So flash, hip, benchmarks, and then the suite is the the project down there.

        [Steven Timm] 11:25:15
        oh, okay, I see it. Okay.

        [David Southwick] 11:25:22
        I have the link on the screen, the yeah, the other benchmark I was talking about, for let's say services.

        [David Southwick] 11:25:30
        This Io Benchmark. I don't have the link on there, but I can share this afterwards.

        [David Southwick] 11:25:36
        It's in a That's right. Prototype state right now.

        [David Southwick] 11:25:40
        We have been working on this year. It doesn't cover all the things you can throw at it yet.

        [David Southwick] 11:25:47
        But yeah, you can download it and play around with it.

        [David Southwick] 11:25:51
        And I should mention that the idea for this will really is from, I I think I don't know if he's in in the room now, but it I've seen, and the previous days so shut up. To.

        [David Southwick] 11:26:04
        Him, I guess.

        [Enrico Fermi Institute] 11:26:07
        Okay, Okay, go ahead.

        [Steven Timm] 11:26:10
        Hello! There was no different so they have Benchmark 3.

        [Steven Timm] 11:26:14
        That's no different than the when they were running on regular note.

        [Steven Timm] 11:26:17
        Then

        [David Southwick] 11:26:18
        Yeah, Yep: exactly.

        [Steven Timm] 11:26:20
        Okay? Good. And then there may be a newly special one.

        [David Southwick] 11:26:25
        Well, there, yeah, like I said, the there was a workshop last week on this discussing.

        [David Southwick] 11:26:33
        you know how to choose the final versions and the waiting and whatnot.

        [David Southwick] 11:26:37
        So there's a little bit more qualified people around, I think, to answer specific questions on on that.

        [David Southwick] 11:26:43
        But it's in progress. Yeah.

        [Enrico Fermi Institute] 11:26:47
        Okay.

        [David Southwick] 11:26:48
        There will be another version, I guess, with the the let's say, gold standard for okay for the Benchmark suite.

        [David Southwick] 11:26:55
        That's decided

        [Enrico Fermi Institute] 11:27:02
        Sure.

        [Dirk Hufnagel] 11:27:04
        yeah, I just had it quick. Question. You said, You're you're benchmarking cpu plus gpu and also, Gpu workloads.

        [Dirk Hufnagel] 11:27:13
        Now I can see. I mean hapscore, and have specs zeros on cpu that's like well established.

        [Dirk Hufnagel] 11:27:20
        You take a mix of experiment, specific workloads and average something, throws something together and get get some average.

        [David Southwick] 11:27:26
        Yep.

        [Dirk Hufnagel] 11:27:26
        I have. What do you do for the Gpu stuff?

        [Dirk Hufnagel] 11:27:30
        Because it's so early, and and the experiments I'll algorithm.

        [Dirk Hufnagel] 11:27:34
        I know Cms has something, but it's not complete.

        [Dirk Hufnagel] 11:27:36
        It's not a complete picture. Do you run synthetic stuff, or do you run the very early stuff?

        [Dirk Hufnagel] 11:27:42
        Because that's the only thing you can do

        [David Southwick] 11:27:43
        We this there. Yeah, So we're running very early stuff from Cms: There's an Lpf.

        [David Southwick] 11:27:50
        Which is a bit of a an exact bird, which Yeah, there was a talk yesterday on that.

        [David Southwick] 11:27:56
        we're we also are using hlt and then as well, the sort of rolling builds from check windows, from mangraph

        [Dirk Hufnagel] 11:28:11
        Okay, thanks.

        [David Southwick] 11:28:13
        I guess there's also some other exotic Gpu workloads, but these are from Beams Department.

        [David Southwick] 11:28:21
        So there's this simple track, or

        [David Southwick] 11:28:26
        I know ped track is in there as well, but I don't think we have a container for Patrack

        [Dirk Hufnagel] 11:28:31
        so it's it's early going. So the numbers you get might not necessarily be representative of ever we end up running in production later

        [David Southwick] 11:28:40
        Exactly so. I mean, there's a lot of results results already, since we you know, we use the the suite as a reporting tool as well.

        [David Southwick] 11:28:48
        So it's it all gets pushed up over Amq into Cabana, and then you can All the workloads are hashed so you can compared performance across every node that it's run on with the same version of the world Okay, so at least with.

        [David Southwick] 11:29:04
        You know that published version? If you have your own bill, let's say you can track You can compare device to device like this, but you're right.

        [David Southwick] 11:29:13
        I mean, these are really, or let's say, snapshot releases of some of these.

        [David Southwick] 11:29:19
        So they will change whenever it's decided that that's going to be a production.

        [David Southwick] 11:29:25
        Or let's say, a a final version validated in some way.

        [David Southwick] 11:29:28
        Yeah.

        [Paolo Calafiura (he)] 11:29:29
        so starting to jump in in Cc. We have assembled the kind of what we think is A is a cross-section of representative applications, and of course we have no, we have No.

        [Paolo Calafiura (he)] 11:29:42
        Standing is just like asking around. I wonder if we should compare now, and we can see if if we, if we pick the same, and under which configuration maybe we should have an offline discussion between our groups

        [David Southwick] 11:29:54
        Sure you're using workforce or like off the shell benchmarks

        [Paolo Calafiura (he)] 11:30:02
        No, no, we're using using a Hp workload.

        [David Southwick] 11:30:06
        Yeah.

        [Paolo Calafiura (he)] 11:30:08
        So simulation tracking. We're not doing machine learning workloads yet.

        [Paolo Calafiura (he)] 11:30:11
        So that's that's something that's something which is not.

        [Enrico Fermi Institute] 11:30:13
        Okay.

        [Paolo Calafiura (he)] 11:30:14
        But we should we should, we should compare notes also because of this dimension.

        [Paolo Calafiura (he)] 11:30:18
        This is not only the workload it's the is, the is the software software platform you use, which makes 1 one configuration different from enough.

        [Paolo Calafiura (he)] 11:30:28
        Anyway. Bye. Shut up

        [David Southwick] 11:30:30
        Yeah, I know we can click offline

        [Enrico Fermi Institute] 11:30:35
        A quick question, How long do these benchmarks take to run

        [David Southwick] 11:30:38
        So if there is some of the the Gp ones can be fast in the mat in the order of I don't know 20 to 60 min from the some of the Cpu ones are much longer depending on what the experiment code owners have put forward.

        [David Southwick] 11:30:58
        As well what they you, as a representative set. So I think the default block or Cbu only, and the current release candidate is something like forward to 6 h.

        [David Southwick] 11:31:13
        But that's many workloads they do back to back, and you run 3 3 iterations of each to get an average and get rid of outliers I'm not sure what that will look like or Gpu because all all the work was that talked about for Hpc that are

        [David Southwick] 11:31:32
        not sort of run 3 standard ones. These are optional things that you can elect to run on with the suites They're not included by deal.

        [David Southwick] 11:31:45
        They are available. You just have to use a little bit different configuration, which is included in the

        [Enrico Fermi Institute] 11:31:53
        Yeah, I guess the thing I'm wondering. And maybe this is just a broader general question for everybody here, I mean, is this the kind of thing that we want to start incorporating?

        [Enrico Fermi Institute] 11:32:02
        Integration process for Hpcs right? We go to, you know.

        [Enrico Fermi Institute] 11:32:07
        Stand up pro mudder, and then the next machine should make sure.

        [Enrico Fermi Institute] 11:32:11
        We run these benchmarks as part of that integration.

        [Enrico Fermi Institute] 11:32:13
        So we start getting, you know, the benchmark numbers in place, and then maybe that helps eventually with the pledging and that sort of thing

        [Dirk Hufnagel] 11:32:23
        I I can just say what we're doing right now, because even though they are optimistic, we still at the end of the year cannot compile the usage and and have spec with 6 to come just to have a comparison and to see what the what the big picture looks like and we just went through

        [Dirk Hufnagel] 11:32:40
        that exercise in 21 for all the Hpc.

        [Dirk Hufnagel] 11:32:44
        That we're using in the Us. And basically what I'm doing right now.

        [Dirk Hufnagel] 11:32:47
        I look at the Cpu, and compared to what?

        [Dirk Hufnagel] 11:32:49
        What others have, benchmark on the on, on, on, and usually you'll find a number where you can come up with a defensible.

        [Enrico Fermi Institute] 11:32:56
        The

        [Dirk Hufnagel] 11:32:58
        Have specs 6, but I mean, especially if you want to get to A If we have really pledging the resources with that, and it becomes relevant.

        [Dirk Hufnagel] 11:33:07
        I think we we need to run the benchmarks.

        [Dirk Hufnagel] 11:33:09
        Maybe maybe not right now. But but but if once we get to that point, I think we we should to get a better number.

        [Enrico Fermi Institute] 11:33:17
        Yeah I mean, it seems like we ought to have a plan for it. I mean, even if even if the content of the benchmarks themselves change over time just to have that kind of in our minds, and kind of in the pipeline for one we integrate these resources.

        [Dirk Hufnagel] 11:33:17
        Okay.

        [Dirk Hufnagel] 11:33:30
        We have a couple of race since Linkedin. Maybe we should get these comments

        [Enrico Fermi Institute] 11:33:34
        Yeah, Okay, Andrew set his hand up

        [Andrew Melo] 11:33:39
        yeah, I was just gonna point out, I mean, so so Dirk talked a little bit earlier about you know, like, How do you?

        [Andrew Melo] 11:33:45
        How do you you know account for the Gpus and in the Hep score?

        [Andrew Melo] 11:33:52
        And you know, he suggested, maybe you give like a 20% bonus, or something like that.

        [Andrew Melo] 11:33:55
        I think that I think that what makes sense, and I argued this at the Heps for meeting last week is that you can't really benchmark machines with just one single scalar anymore.

        [Enrico Fermi Institute] 11:34:00
        Cool.

        [Andrew Melo] 11:34:04
        Right? So I think you're just gonna have to be some sort of tuple, you know, per machine to have these different accelerators on it.

        [Andrew Melo] 11:34:11
        And I was gonna also point out that that while you know we're we're working on, I guess you can call it hip score 22 with with run 3 work close the the number that pops out of he score right now is it waited

        [Andrew Melo] 11:34:26
        after. It's just the feature. So, at least initially, what will be pledged You know the this head score unit that will be pledged will will only be taken into account the the Cpu which one.

        [David Southwick] 11:34:29
        Okay.

        [David Southwick] 11:34:37
        Thanks, Andrew, and I'd like to add on to that, since we do have this reporting automated reporting, and it gives you the the Json.

        [David Southwick] 11:34:47
        Okay, all of the work ones. Yes, it will give you a single say, Hepscore value, but it also gives you the value.

        [Enrico Fermi Institute] 11:34:49
        Yes.

        [David Southwick] 11:34:55
        Every workload, So if you once like, you're just interested in htt, or whatever it is, you can get the number on that were on that machine for that benchmark, and you can go compare just then the benchmarks you're interested in this is also already available.

        [David Southwick] 11:35:14
        So it is a bit, I guess, in that way

        [David Southwick] 11:35:20
        A bit more fine grain than than what we had in the past.

        [Enrico Fermi Institute] 11:35:25
        Okay, question or comment from Ian

        [Ian Fisk] 11:35:29
        yeah, it was only that I It was second, that I think it's valuable to be using the benchmarks as we begin to commission the multiple Hpc Sites also I think, in addition to having the the benchmark that tells you a number about how well they're

        [Ian Fisk] 11:35:43
        performing. I'm wondering, if it also starts a purpose as sort of what we used to think of as the site availability tests for some things There's a diversity in the workflows and that if they all succeed and give reasonable numbers.

        [Ian Fisk] 11:35:54
        You also have a reasonable expectation that the site is pretty well configure.

        [Ian Fisk] 11:35:58
        Again.

        [Andrew Melo] 11:36:03
        so so So a fun anecdote about that, email, you know, we we actually did see, some of this where someone was bench benchmarking Dean and saw the you know this machine.

        [Andrew Melo] 11:36:15
        That you know they They knew what the head score should be, for that machine was about half or you know, 75% of what they were seeing, and it turns out that the cooling of that racket failed and the machine, was actually power throttling.

        [Andrew Melo] 11:36:24
        So it was something that yeah, people are able to say, Hey, this machine isn't working right just from looking at these numbers

        [Ian Fisk] 11:36:30
        alright.

        [David Southwick] 11:36:33
        Yep.

        [Ian Fisk] 11:36:33
        Yeah, I think the the other thing that the as we in the Commission, probably more applicable to Hpc.

        [Ian Fisk] 11:36:40
        Than cloud is that these machines are much more complicated than we're.

        [Ian Fisk] 11:36:44
        They're not as sort of simple as a pile of essentially x 86 servers.

        [Ian Fisk] 11:36:48
        They tend to have more complex services, whether cooling or interconnect, or whatever; And so I'm a more detailed set of benchmarking meetings

        [Enrico Fermi Institute] 11:36:59
        Okay.

        [Enrico Fermi Institute] 11:37:02
        So you You have in one of your sites. You mentioned that you know you do the uploading of all the results, and you also mentioned that you have some kind of batch uploader for portal.

        [Enrico Fermi Institute] 11:37:13
        Secure, workers. Does. That mean that this would work on Lcfs where the workers don't have any, you know.

        [Enrico Fermi Institute] 11:37:18
        Outbound connectivity would just batch it and upload it from the login nodes.

        [Enrico Fermi Institute] 11:37:21
        Is that is that the idea

        [David Southwick] 11:37:23
        Exactly. You know. Sites like this are that's common, but they're not uncommon, either.

        [David Southwick] 11:37:30
        There's several that we've been working with here in Europe that have a similar configuration, and normally the default case for this is just running a you know the base case is to run it on single node for vendors.

        [David Southwick] 11:37:43
        Or whatever it is, and when the runs are finished it'll compile the report, and then send it over and queue.

        [David Southwick] 11:37:50
        But if you don't have connectivity on the machine that you benchmark, then you can collect these.

        [David Southwick] 11:37:59
        these Jason afterward and do a batch reporting basically Yeah, from a from a gateway note

        [Enrico Fermi Institute] 11:38:06
        Okay, great.

        [Enrico Fermi Institute] 11:38:10
        Other questions for David

        [Enrico Fermi Institute] 11:38:18
        Okay, Thank you, David.

        [David Southwick] 11:38:19
        Yep thanks.

        [Enrico Fermi Institute] 11:38:25
        The slides again

        [Dirk Hufnagel] 11:38:32
        yeah, so.

        [Dirk Hufnagel] 11:38:36
        So when you, when when we look at that process, so assume we we get some benchmark.

        [Dirk Hufnagel] 11:38:40
        Now some defensible numbers, and we figure out how we gonna deal with the Cpu problem and work out how we pledge this with W.

        [Dirk Hufnagel] 11:38:48
        Sog At that point accounting become goes from Nice to have to.

        [Dirk Hufnagel] 11:38:55
        Actually we have to justify what we're using, and to show that we're actually fulfilling the patch.

        [Dirk Hufnagel] 11:39:01
        And at that point accounting becomes from from from it comes mandatory. Right now.

        [Dirk Hufnagel] 11:39:07
        It's optional, because we want to know what we're using.

        [Dirk Hufnagel] 11:39:10
        But when when the numbers start to matter, when we actually have a pledge in place, then we need to show that we actually deliver that pledge and the current situation is is that Cms, is, is made a push last year or made an effort last year, to to get all the accounting data

        [Dirk Hufnagel] 11:39:29
        pushed to apple. We still have some problems there with some sites where we're using like multi note jobs where the the system isn't quite aware.

        [Dirk Hufnagel] 11:39:38
        That there's actually multiple notes behind it. It thinks it's one.

        [Dirk Hufnagel] 11:39:41
        But but that's some technical, difficulties we're working on and in principle things are are connected Atlas doesn't currently, do.

        [Dirk Hufnagel] 11:39:51
        This cloud usage is is an open question. What, Fernando? You're not currently pushing the your cloud users data to to apple right or to any great accounting portal

        [Fernando Harald Barreiro Megino] 11:40:03
        no, I'm not this. Okay, apple that is written in the slides.

        [Fernando Harald Barreiro Megino] 11:40:11
        That's a solution that, for example, Ryan Taylor from University to Victoria, implemented because he's using a similar model as so using the cloud his private cloud and there he did this K.

        [Fernando Harald Barreiro Megino] 11:40:25
        Upper, because he needed to push the resources. So there is some solution.

        [Fernando Harald Barreiro Megino] 11:40:31
        But I don't have experience with that. And I'm not using it at the moment.

        [Fernando Harald Barreiro Megino] 11:40:35
        And it's also not applicable to. For example, Hpcs

        [Dirk Hufnagel] 11:40:41
        Okay, and and and on top of accounting, there's also monitoring like operational monitoring.

        [Dirk Hufnagel] 11:40:49
        We were doing this already, but depending on what integration method you pick for an Hpc.

        [Dirk Hufnagel] 11:40:56
        this can be tricky. For instance, in in the Us.

        [Enrico Fermi Institute] 11:40:56
        Okay.

        [Dirk Hufnagel] 11:41:01
        We we basically overlay a logical side on top of each Hbc facility, basically internally in Cms is this thing monitoring infrastructure.

        [Dirk Hufnagel] 11:41:16
        And the assumptions that it's it's build on.

        [Dirk Hufnagel] 11:41:19
        But, for instance, for the the Italian case, with the with the site extension to an to the Marconi, 100 Hpc that they've they chose a different model It's it's it's a site extension so basically, everything, is is that is under the

        [Dirk Hufnagel] 11:41:36
        T; one side umbrella, and then an accounting can be a bit tricky because you cannot use the the the site name as a as a dividing line between what resources are tier one What resources?

        [Dirk Hufnagel] 11:41:48
        Are on the Hpc. And then then you kind of have to look at sub so identify us that that basically divide this further into subsides and not all monitoring systems are basically geared to support that when we've done some work on that it's it's not Yeah, But

        [Dirk Hufnagel] 11:42:10
        this is the problem. And in in cloud, so far, at least, Atlas is, is, is doing their own separate side, and ponder.

        [Dirk Hufnagel] 11:42:21
        So this this problem doesn't come up. Also, the Cms scale tests basically overlay a separate side on the cloud resources.

        [Dirk Hufnagel] 11:42:29
        but you could also, imagine if if you do, seamless extension of like, if it's 2 decides that they want to support extension of the batch resources in the cloud, that issue will also come.

        [Dirk Hufnagel] 11:42:40
        Up. I mean, if if they want to do separate accounting, that is

        [Enrico Fermi Institute] 11:42:46
        Okay.

        [Dirk Hufnagel] 11:42:47
        We don't have anyone from Osg here, so I don't think we can get any comment on that

        [Enrico Fermi Institute] 11:42:53
        And

        [Enrico Fermi Institute] 11:43:00
        Other comments about accounting. I'm sure we move on

         

        PLEDGING

         

        [Dirk Hufnagel] 11:43:14
        Okay, Now, we actually have something about pledging, And we already talked about that, and in in the last 2 days, and I don't want to rehash that this discussion here we we talked about the difference between Ac and Dc.

        [Enrico Fermi Institute] 11:43:18
        Yes.

        [Dirk Hufnagel] 11:43:31
        And planning, like capacity, integrated capacity, and instantaneous capacity, and and the problems related to that and and how it, how it, how the scheduling of of Hbc and cloud impacts that yeah, also one thing to look is that they're not Hbc: and cloud resources

        [Dirk Hufnagel] 11:43:54
        they're not official osg. And Egi side, so we don't.

        [Dirk Hufnagel] 11:43:58
        We don't want to be Don't cut, gigs, tickets We talked, we talked about the G.

        [Dirk Hufnagel] 11:44:02
        You guys tickets already. And when we talked about the cost to support this resources, you need to set up some unit that supports them.

        [Enrico Fermi Institute] 11:44:06
        Okay.

        [Dirk Hufnagel] 11:44:10
        for instance, Cms has that team at formula. So anytime there's problem at the Us.

        [Dirk Hufnagel] 11:44:16
        Hbc. Sites. We get a ticket for that, and and the question here is, where do we see this going?

        [Dirk Hufnagel] 11:44:22
        And like, maybe not next year. Maybe maybe not even 2 years.

        [Dirk Hufnagel] 11:44:27
        But where where we are, where do we want to be in 5 years, like?

        [Dirk Hufnagel] 11:44:32
        When we asked, Let's say, just before Hrc.

        [Dirk Hufnagel] 11:44:35
        Starts, up, What's what's the goal here? And then we can.

        [Dirk Hufnagel] 11:44:40
        This requires discussion with Wsg

        [Douglas Benjamin] 11:44:48
        in both cases we don't own clouds or the Hpcs right

        [Dirk Hufnagel] 11:44:54
        Yeah, we don't, we? We don't own them, but we we basically what?

        [Douglas Benjamin] 11:44:59
        Therefore we are customers of them.

        [Dirk Hufnagel] 11:45:02
        What are we leasing them? I mean, in some sense it doesn't matter who actually owns the hardware.

        [Enrico Fermi Institute] 11:45:05
        Okay.

        [Dirk Hufnagel] 11:45:08
        It matters that you get guaranteed access in some way

        [Douglas Benjamin] 11:45:13
        Not necessarily.

        [Douglas Benjamin] 11:45:17
        Right. We are a customer

        [Dirk Hufnagel] 11:45:20
        Via customers Correct

        [Douglas Benjamin] 11:45:22
        So we have to deal with the interface layer. That is our community ready. I. E.

        [Dirk Hufnagel] 11:45:27
        But that's that that's support that's that goes into support that we, since we are customer, we we have to be the middleman for for supporting the the resources.

        [Douglas Benjamin] 11:45:27
        J. Goes.

        [Douglas Benjamin] 11:45:38
        And then the pledging right, but the pledging then comes from money that we get to provide the compute.

        [Dirk Hufnagel] 11:45:38
        So we we are the the interface to the to the experiment

        [Douglas Benjamin] 11:45:47
        If we have to provide to Atlas

        [Dirk Hufnagel] 11:45:49
        Yeah, the fundamental difference is that the, the, the entity that owns the resources is doesn't have a relationship with W.

        [Dirk Hufnagel] 11:45:59
        They basically they have only a relationship with us. And then we have a relationship with Wlg.

        [Dirk Hufnagel] 11:45:59
        they.

        [Douglas Benjamin] 11:46:09
        And do you expect that to change? And 5 years? I don't

        [Douglas Benjamin] 11:46:15
        He served it. Both Clouds and Hbc. Serve different.

        [Douglas Benjamin] 11:46:18
        The sites serve different masters different. They have different, You know the Hpcs in the Us.

        [Douglas Benjamin] 11:46:26
        Are responsible to Msf. And do we

        [Douglas Benjamin] 11:46:33
        Right. Am I missing something?

        [Steven Timm] 11:46:34
        Okay? Awesome.

        [Dirk Hufnagel] 11:46:35
        No But But why do we need to care? Because if we get an allocation for a 1 million like a 100 million hours, that's something we can use.

        [Dirk Hufnagel] 11:46:45
        We'd have no guarantees when we can use it.

        [Dirk Hufnagel] 11:46:48
        But well, we get something that we'll have this over a period of time.

        [Steven Timm] 11:46:50
        And

        [Steven Timm] 11:46:56
        And at least in the Us. And we know for the funding This is the way we're going.

        [Steven Timm] 11:47:00
        They're not going to be funding the level of blab on computers that they have great.

        [Steven Timm] 11:47:06
        So we're ready to evolve beyond resources that they own and work on a regular basis.

        [Steven Timm] 11:47:13
        And the ones that they don't want. They don't own, I mean smaller experiments have been doing this forever, just running up basically everywhere.

        [Steven Timm] 11:47:24
        Zoom account

        [Dirk Hufnagel] 11:47:27
        Oh, hello! You had to comment on this discussion

        [Paolo Calafiura (he)] 11:47:29
        the yes, just to say that I don't think it's particularly productive to discuss what what the you know, one support and what the the the agencies, what the agencies long.

        [Paolo Calafiura (he)] 11:47:46
        Term plan we need to be ready for what they are telling us now, which is that they want us to use the secondary services, But that that's that wasn't the reason I raised my end I wanted I wanted to bring up.

        [Paolo Calafiura (he)] 11:48:00
        Another angle, and I don't know if this is just the natural single, or or if you guys, have a senior concept.

        [Paolo Calafiura (he)] 11:48:05
        So. And this is has to do with this capacity. It's versus power that we have been There's that the or what versus power that we've been discussing.

        [Paolo Calafiura (he)] 11:48:15
        So in Atlas we have the concept of pledge, and beyond pledge resources, and I I I don't even wanna try to tell you why we have this distinction but to, the the it is historical.

        [Paolo Calafiura (he)] 11:48:31
        But the the reality is that the pledge in our class is sufficient to process data more than sufficient to process day time to produce a meeting of simulation that would probably be sufficient to process data 2 to one analyze data, but we do rely on a very substantial amount of beyond

        [Paolo Calafiura (he)] 11:48:54
        pledge resources, which is a which is taking it taken into account, is measured, and the end is not technically pledged, so there is no distinction in my mind.

        [Paolo Calafiura (he)] 11:49:07
        Between a tier, 2 delivering twice as many resources as they as they are supposed to, and the an Hpc.

        [Paolo Calafiura (he)] 11:49:16
        Delivering this delivering the same resources. So the question, the question is, is this concept of pledge absolutely fundamental?

        [Paolo Calafiura (he)] 11:49:27
        Or is it something which is, you know? We start together, we deal with it, and then and then we treat the Hpcs and the you know any resource which which can deliver resources on not on a constant basis.

        [Paolo Calafiura (he)] 11:49:42
        But on a on a opportunistic way just this, beyond pledge

        [Dirk Hufnagel] 11:49:50
        I mean, we're doing this now, That's how we're treating Hpc: Now the the question is going going forward.

        [Dirk Hufnagel] 11:49:58
        If we manage to be able. Let's see, we we manage to to get the Lcf.

        [Dirk Hufnagel] 11:50:04
        Working, and we can run like, really, lo, super, large scale. Yeah, So we figured out, is this going to be okay?

        [Dirk Hufnagel] 11:50:15
        Because at that point it's going to be a much larger fraction of it.

        [Dirk Hufnagel] 11:50:18
        Might be a much larger fraction of overall resources Is this beyond pledge models still working, then, in this case.

        [Paolo Calafiura (he)] 11:50:22
        Hmm.

        [Paolo Calafiura (he)] 11:50:29
        I I I would say, we'll cross the bridge if we can to it.

        [Paolo Calafiura (he)] 11:50:33
        But you know, in India and the India it's I guess, What I'm saying is that I'm not sure it is a particularly important distinction right now.

        [Paolo Calafiura (he)] 11:50:44
        That resources pledges are not pledged, That's that's I guess what I'm trying to say.

        [Paolo Calafiura (he)] 11:50:52
        And yeah, you are right. If if we end up ending the 75% of our resources being the non pledge, then it's weird.

        [Paolo Calafiura (he)] 11:51:00
        Yes, Oh, we have in perfect

        [Dirk Hufnagel] 11:51:01
        I mean. And also Brian made a good argument on on Monday is at some point. So far we're looking at this from our viewpoint at some point it might become a problem for the agencies, because they want credit for it so that could become an issue

        [Paolo Calafiura (he)] 11:51:21
        Well, we do, at least, I'm sure you do the same, at least in Atlas we do.

        [Paolo Calafiura (he)] 11:51:24
        We do a knowledge with that content

        [Dirk Hufnagel] 11:51:26
        Yes, we reported, but but that, as far as Wcg.

        [Dirk Hufnagel] 11:51:31
        Is concerned they're not. I don't know. There's second class resources, so I don't know how much they've met us to the agency

        [Paolo Calafiura (he)] 11:51:37
        Well, I think it's far. It's part of one of these is concerned that Wcg.

        [Paolo Calafiura (he)] 11:51:42
        Does a monster

        [Paolo Calafiura (he)] 11:51:47
        yes.

        [Dirk Hufnagel] 11:51:58
        What about the the computing plans?

        [Douglas Benjamin] 11:52:01
        and can I ask them the question another way? In the next 5 years?

        [Douglas Benjamin] 11:52:04
        Will we need these to meet our pledge, Kevin.

        [Douglas Benjamin] 11:52:10
        Our current given sort of flat funding

        [Douglas Benjamin] 11:52:17
        Because you said a 5 year timeline

        [Dirk Hufnagel] 11:52:20
        Yeah, that's what I That's I had a similar thought, a question that basically because pledge pledge means something in terms of what you can plan.

        [Dirk Hufnagel] 11:52:30
        Right. I mean the you base your planning on on pledge and beyond pledge is something you could add extra.

        [Enrico Fermi Institute] 11:52:30
        Okay.

        [Dirk Hufnagel] 11:52:37
        So

        [Dirk Hufnagel] 11:52:41
        If that extra becomes required to to what you need to do as a baseline, doesn't, it need to be included in the pledge

        [Enrico Fermi Institute] 11:53:02
        okay.

        [Dirk Hufnagel] 11:53:11
        I guess no one has an answer for that.

        [Enrico Fermi Institute] 11:53:19
        Just need more time to think about

        [Dirk Hufnagel] 11:53:21
        Yeah, I mean, it's I don't think I mean, this is this is future.

        [Dirk Hufnagel] 11:53:26
        This is I think these are The questions should go into report.

        [Dirk Hufnagel] 11:53:30
        But it's it's and then it's it's anyway outside the scope of this workshop. And maybe even this report in general, these these kinda discussions on this topic.

        [Douglas Benjamin] 11:53:42
        But how much labor do we want, or for beyond pledge activity, how much labor is acceptable versus excessive

        [Douglas Benjamin] 11:53:57
        In other words, if if it takes 3 ftees to do 3% of the Monte Carlo Atlas needs us contribution, then you might consider that excessive

        [Dirk Hufnagel] 11:54:10
        Holy, you! You wanna weigh in on this

        [Oliver Gutsche] 11:54:13
        Well, I let let me try, so I I think the agencies are also trying to optimize their budget.

        [Oliver Gutsche] 11:54:24
        Right? So in the end that the agencies need to enable us to do our site.

        [Oliver Gutsche] 11:54:30
        So, if if the agencies have the possibility to say, okay, instead of getting you all the money to on all your sites, some of your processing will come from reliable allocations on Hpc: then the question is, do they fulfill the requirements to be

        [Oliver Gutsche] 11:54:52
        acknowledged as this official contribution to the experiment, and then it becomes a cost question right as as you asked, how many fts is reasonable.

        [Oliver Gutsche] 11:55:05
        It depends, then, how much money, how much funds you would actually save by this approach.

        [Oliver Gutsche] 11:55:12
        So So I think the question that we might have to answer is, How much does it cost us to pledge Hpc.

        [Oliver Gutsche] 11:55:23
        Resources, and that then goes into the calculation of us.

        [Oliver Gutsche] 11:55:30
        And the agencies of how use, and how how how efficient it is to actually pledge Hpc resources for our purposes.

        [Enrico Fermi Institute] 11:55:32
        Okay.

        [Oliver Gutsche] 11:55:39
        So that would be very interesting assessment for us.

        [Enrico Fermi Institute] 11:55:45
        Well, and and all of the

        [Oliver Gutsche] 11:55:46
        I don't know if I make Yeah, sure. Sorry. Go ahead.

        [Enrico Fermi Institute] 11:55:50
        No, I'm just gonna say that. But then the other half of that is, if if if and you're right, that we will have to answer the question at some point, how much does it cost to provide X resources from Hbc's but then we also need to be ready for the immediate other question which

        [Enrico Fermi Institute] 11:56:03
        is, going to be. How much does it cost for us to provide those same resources on premises?

        [Enrico Fermi Institute] 11:56:09
        I guess right.

        [Oliver Gutsche] 11:56:11
        yeah. But I mean for for the letter question, I mean, we we we have 15 years of experience to do that. Right?

        [Oliver Gutsche] 11:56:21
        So so so for me, for for for this this exercise.

        [Oliver Gutsche] 11:56:26
        It's really about. If we want to have Hpc resources, all commercial class resources replace pledges that we normally would provide through our sites.

        [Oliver Gutsche] 11:56:37
        What would it actually cost

        [Paolo Calafiura (he)] 11:56:51
        one thing, one thing which I guess I cannot do to this is that, and to clarify what I was saying before, is, that I think I think that what we need right now, and that's why that the new benchmarking is so important.

        [Paolo Calafiura (he)] 11:57:06
        What we need right now is A is a reliable way to do accounting.

        [Paolo Calafiura (he)] 11:57:13
        And so, for example, we can then answer the question. The tag was asking, Is it worth having?

        [Paolo Calafiura (he)] 11:57:18
        3 fps to get 3% of that of the all good resources I I don't want to use the word pledge, though nonplex with us is forget about that.

        [Enrico Fermi Institute] 11:57:27
        okay.

        [Paolo Calafiura (he)] 11:57:28
        If if I have, if a site is giving me 3% of the resources, how much money, how much effort, and therefore money do I need? That?

        [Paolo Calafiura (he)] 11:57:38
        Is it worth it? Put into it so that I think the the problem is that right now we do not.

        [Paolo Calafiura (he)] 11:57:46
        Well, we we set also the problem. That is for that we do not.

        [Paolo Calafiura (he)] 11:57:52
        We do not have basically any work. Folks running on on Lcs, because we don't have any accelerated workloads really in production.

        [Paolo Calafiura (he)] 11:57:57
        But the assuming we do add them, then we need that.

        [Paolo Calafiura (he)] 11:58:01
        Then we need the hey? Hey? Cool way to measure what is the contribution?

        [Paolo Calafiura (he)] 11:58:08
        Could be the number of events. But then, what? Of what? Oh, what kind, what kind of events is?

        [Paolo Calafiura (he)] 11:58:15
        That is it? The full simulation simulation, reconstruction.

        [Paolo Calafiura (he)] 11:58:19
        So it is for me that the critical question is the accounting is not the pledging.

        [Paolo Calafiura (he)] 11:58:24
        I'm not okay. Just to reformulate what I was saying before

        [Enrico Fermi Institute] 11:58:40
        okay.

        [Dirk Hufnagel] 11:58:46
        I couldn't do. We wanna move on. I don't think that's I mean we have many questions, but since it concerns the future as it's expected.

      • 10:30
        Security Topics (WLCG, DOE) 30m

        Authentication, Identity Management on Cloud and HPC
        CA Certificate Issues
        Google/Amazon CAs aren’t trusted by IGTF

        [Eastern Time]

         

        Security Topics and Discussion

         

        Security topics bullet

        [Enrico Fermi Institute] 11:58:49
        Yeah.

        [Enrico Fermi Institute] 11:58:55
        Need. Yeah, So the next slide was just placeholder for that.

        [Dirk Hufnagel] 11:59:02
        And then another question from the charge was, what new facility features the policies would help use.

        [Dirk Hufnagel] 11:59:09
        Atlas and Usc. Adopt Commercial Cloud and Hbc.

        [Dirk Hufnagel] 11:59:13
        Resources, one thing that we had in here was security Don't we invited some security folks, But I don't think anyone actually managed to connect.

        [Dirk Hufnagel] 11:59:24
        So that's a little bit of a pity. But one big problem.

        [Dirk Hufnagel] 11:59:31
        Why, apart from the elsef restriction with the no outbound Internet, from the worker notes also most of the Hpc these days have some saw of Alright Mfa login procedure, so you cannot really connect from the outside to to the to desk systems without going through some

        [Enrico Fermi Institute] 11:59:37
        Yeah.

        [Dirk Hufnagel] 11:59:52
        mfa process, and that usually would mean that we cannot really integrate things in an automated provisioning systems and things like that.

        [Dirk Hufnagel] 12:00:01
        But mfa some Hbc's a bit more flexible.

        [Dirk Hufnagel] 12:00:06
        What they allow Mfa to meet like at at Lcfs.

        [Enrico Fermi Institute] 12:00:08
        Yeah.

        [Dirk Hufnagel] 12:00:12
        It's basically strict hardware talking. So phone apps, so you can't do anything unless you and any kind of outside connection goes through that step.

        [Dirk Hufnagel] 12:00:20
        So cannot be automated. The Nsfunded Hbc.

        [Dirk Hufnagel] 12:00:26
        So far at least more forgiving. They'll they can say, Okay, M.

        [Dirk Hufnagel] 12:00:30
        If it can mean that that the system that locks in remotely is comes from a certain Ip or or or things like that or they allow Mfa to be bypassed at the moment, still in general as as a policy, question and then fernando you want to say something on

        [Dirk Hufnagel] 12:00:47
        the the cloud issues with the with the Cas

        [Fernando Harald Barreiro Megino] 12:00:51
        yeah. So they are. The problem. Is that so? The Oh, the group on Amazon.

        [Fernando Harald Barreiro Megino] 12:01:01
        They use the their own certificate authorities, and those are not trusted by Igtf.

        [Fernando Harald Barreiro Megino] 12:01:11
        And in particular, if you want to do a third party transfer, it means great spite.

        [Fernando Harald Barreiro Megino] 12:01:15
        The cloud, The cloud ca is not trusted, and then the transfer phase, and you need to do some.

        [Fernando Harald Barreiro Megino] 12:01:23
        I put something in front of the this storage with another certificate, and this is it it?

        [Fernando Harald Barreiro Megino] 12:01:34
        It it can become a bottleneck. It could be preferable if if the third party transfers would work, This is being discussed already in the Wcg.

        [Fernando Harald Barreiro Megino] 12:01:47
        But I heard that they that probably there will not be a solution in the next.

        [Fernando Harald Barreiro Megino] 12:01:55
        Yes, but this is some to long term problem

        [Enrico Fermi Institute] 12:02:02
        So? Is the is the issue that that the the Wcg.

        [Enrico Fermi Institute] 12:02:05
        Would need to accept the the certificate authorities of the of the Commercial Cloud Providers

        [Fernando Harald Barreiro Megino] 12:02:12
        Yes, they would have to become part of the ittf, and I don't know exactly what are the the policies to get into the igts, and I understand it also require some effort from the from the Cloud Button for them.

        [Fernando Harald Barreiro Megino] 12:02:31
        Maybe it's not. It's not worth it. So that's why there is.

        [Fernando Harald Barreiro Megino] 12:02:37
        There is not really a solution. This will tell me, as far as I understand.

        [Enrico Fermi Institute] 12:02:43
        I'll just point out that it's but the ittf is is bigger than just the Wcg.

        [Enrico Fermi Institute] 12:02:47
        So it's not necessarily Wcg. Has to. It's a It's a step above them that we would have to convincing to do it

        [Dirk Hufnagel] 12:03:02
        Oh! And that I just wanted to. I skipped the Federated Identity point because I think, without a security person from Lap here, there's no point discussing that I I just to to mention what it is before I to Eric that basically, the labs have been working on, kind of

        [Dirk Hufnagel] 12:03:24
        federating the systems in terms of logins, and and so on.

        [Dirk Hufnagel] 12:03:28
        but I'm not sure if this will help, because I, as far as I know, Mfa, is still required on top of that, so I might be able to log into an to argon with my firmelap id but I still would have to go for the mfa step as far as I know I I just don't

        [Dirk Hufnagel] 12:03:44
        know if that's something that eventually could dropped with some, maybe private networks between the the the the national apps.

        [Dirk Hufnagel] 12:03:51
        But that's I wanted to get seek back from it.

        [Dirk Hufnagel] 12:03:54
        Security. I guess we have to offline Eric

        [Eric Lancon] 12:03:58
        yes, I wanted to comment. I have 2 points. So when you come to me fair identification and cloud cas, that's a price to pay when you don't own the resources.

        [Eric Lancon] 12:04:16
        So instead of complaining, we should find innovative solution whenever it's possible to use better those solutions on the Federated it.

        [Eric Lancon] 12:04:30
        I I invited the people online for this meeting.

        [Eric Lancon] 12:04:35
        That's so. If there are no the feminine people, or we can have a the Dnn perspective Your home is on the line wanted to to remind that me Fair will certainly become a Stand up everywhere.

        [Eric Lancon] 12:04:56
        Well, it's a 5 years time scale, so we should adapt to and foresee that we'd have to work with it.

        [Eric Lancon] 12:05:06
        So on, federated id If you want to say a few words, but

        [Jerome Lauret (he/him)] 12:05:10
        cool, sure. Let Let Let me write away. Jump in the nfa topic and close.

        [Jerome Lauret (he/him)] 12:05:19
        This in general, any services that we have put in place, that are cloud inspired, or port provide access for a wide variety of people from multiple different organization.

        [Jerome Lauret (he/him)] 12:05:35
        Required. Mfa. There is. There's been no no escape to it So far we have been able to set up.

        [Jerome Lauret (he/him)] 12:05:43
        Of course you know a few services, and Gp. To Herb, and things like that.

        [Jerome Lauret (he/him)] 12:05:47
        But has been essentially the prerequisite.

        [Jerome Lauret (he/him)] 12:05:53
        do you the other comment on federated Id is that, of course it all depends on the Federation, and we have been basically somewhat allowed to proceed with many of the Trusted Federation.

        [Jerome Lauret (he/him)] 12:06:10
        Of what Dewey sees it as trusted. So that is, for example, people coming from all the national labs.

        [Jerome Lauret (he/him)] 12:06:18
        we have an exemption for certain federation as well.

        [Jerome Lauret (he/him)] 12:06:23
        but in general, for example, we have had a consistent message that Google, for example, is not a trusted and acceptable federation and the reason for that is just that anybody anytime can impersonate anyone And and a live here on our side to we saw something kind of funny funny identity so

        [Jerome Lauret (he/him)] 12:06:47
        that's all I wanted to say. But of course there's a lot of Paul guys already seeing the the fact that we are being told.

        [Jerome Lauret (he/him)] 12:06:55
        Okay, please. Proceed with directed, Id is very encouraging. But this is a long road, and I think that early on someone mentioned also that you know the and Mfa.

        [Jerome Lauret (he/him)] 12:07:09
        Or Sfa could be bypassed in some ways.

        [Jerome Lauret (he/him)] 12:07:10
        That's yes, that's true. And there is a lot of work here.

        [Enrico Fermi Institute] 12:07:12
        Okay.

        [Jerome Lauret (he/him)] 12:07:13
        to also add trusted metadata. You know, as part of the certificate; but this is indeed not an immediate E immediate, development.

        [Jerome Lauret (he/him)] 12:07:24
        So that's why, perhaps right now do we prefer to have only trusted Federation, just to be sure that you know everybody is essentially the same kind of whole of engagement

        [Enrico Fermi Institute] 12:07:40
        So? There. 1 One question I had is, you know, security perspective.

        [Enrico Fermi Institute] 12:07:45
        Is is it? Is it the case that Mfa is fundamentally analogous to the that you want to have some human interaction to authenticity with a resource, or you know does the mfa.

        [Enrico Fermi Institute] 12:08:04
        Just mean that you you know, you really just need these multiple facts, right?

        [Enrico Fermi Institute] 12:08:09
        It's it's not sufficient just to have a key or password, or whatever you need to have some additional factor

        [Jerome Lauret (he/him)] 12:08:15
        Cool. Cool. Go back. It's a second. It's a simple, and in fact, you know in some cases well, even required to have a secret handshake.

        [Jerome Lauret (he/him)] 12:08:27
        I mean, okay, essentially, know the point of contact from within the experiment that you are in order to have an account, and being approved because this is the level of confidence depends, of course, on the service that you access just yeah, just to be sure that you you you see the difference.

        [Enrico Fermi Institute] 12:08:30
        And

        [Enrico Fermi Institute] 12:08:35
        Sure.

        [Enrico Fermi Institute] 12:08:41
        Sure.

        [Jerome Lauret (he/him)] 12:08:46
        Is that if you access matter most, for example and book, even your federated Id is enough.

        [Jerome Lauret (he/him)] 12:08:53
        If you are issue accounts something that essentially allows you to make modification to the content.

        [Jerome Lauret (he/him)] 12:09:02
        Then me phase require you, and if you do access computing resource right now, things where you can eventually launch a large number of jobs that so the knee fumble Kevin, that's in the up here as being some kind of illegal activity then not only it's an but in order to have your account

        [Jerome Lauret (he/him)] 12:09:20
        approved. You need some extra steps and verification that we came up with a procedure that was actually acceptable by outside the team.

        [Jerome Lauret (he/him)] 12:09:29
        but you know, so there's some kind of like jail of acceptance

        [Enrico Fermi Institute] 12:09:35
        So so I guess then, is it fair to say that Mfa.

        [Enrico Fermi Institute] 12:09:40
        Does not. Fundamentally, you know, schoolude automation

        [Jerome Lauret (he/him)] 12:09:47
        I would say

        [Enrico Fermi Institute] 12:09:47
        Renton what you know that you know, For example, Tak has done where they where they consider, a you know, a trusted machine to be a factor on top of you know the the key that you provide

        [Jerome Lauret (he/him)] 12:10:01
        Right. So actually, actually, this is a excellent question, because, of course, for example, jobs, submission from a trusted host.

        [Jerome Lauret (he/him)] 12:10:11
        Right, for example, in Osg Land has been, of course, accepted right so.

        [Enrico Fermi Institute] 12:10:16
        Okay.

        [Jerome Lauret (he/him)] 12:10:17
        And that indeed, is somewhat what you are hinting, as you know, that host is trusted, and you know to access that host to submit.

        [Jerome Lauret (he/him)] 12:10:30
        Then you require, Then you have, additional you know, authentication.

        [Jerome Lauret (he/him)] 12:10:35
        You understand what I'm saying right? I mean you you Log, for example, to that host using your local credential.

        [Jerome Lauret (he/him)] 12:10:40
        Then you eventually issue a token, or whatever which is yet a second factor of then use to meet your job, and that that has been accepted for quite a while.

        [Jerome Lauret (he/him)] 12:10:51
        So you you You are right, that there may be some leeway there, some some home in in that sense.

        [Jerome Lauret (he/him)] 12:10:58
        Okay.

        [Enrico Fermi Institute] 12:11:00
        I think

        [Enrico Fermi Institute] 12:11:09
        Other other comments or questions about mfa or Cloud. Ca's httf sort of thing

        [Dale Carder] 12:11:13
        I know a nurse. There. There's a process to get long term keys instead of like the default.

        [Dale Carder] 12:11:18
        24 h key for Ssh. Proxy

        [Dirk Hufnagel] 12:11:21
        Yeah, it's it's up to a month, I think, to support

        [Dale Carder] 12:11:24
        Yeah.

        [Dirk Hufnagel] 12:11:27
        But you do need to get that key. You need to go through an mfa process, and then you're okay.

        [Enrico Fermi Institute] 12:11:34
        Sure.

        [Dirk Hufnagel] 12:11:34
        For 30 days. They They basically that's that's the compromise between.

        [Dirk Hufnagel] 12:11:37
        We don't want to allow automation when no one authenticates for a couple of years, and then an age will.

        [Dirk Hufnagel] 12:11:42
        If the that key gets compromised, then basically everyone can use it forever to you. At least you don't have to do this meeting, at least operational feasible.

        [Dirk Hufnagel] 12:11:53
        To go to to use, to use the system, even with the Mfa. Rules.

        [Dirk Hufnagel] 12:11:57
        Some place.

        [Dirk Hufnagel] 12:12:01
        And I mean globus online, is the same. That's what we're doing with the the rush. You online.

        [Dirk Hufnagel] 12:12:06
        Integration for the transfers, someone actually has to log in manually to the portal and renewed a key once a week so that the transfers can keep keep going But it's it's I mean once a week, once a month that's all that just means you roll it into cost it's

        [Enrico Fermi Institute] 12:12:15
        Right.

        [Dirk Hufnagel] 12:12:23
        it's it's it's a bump on the cost for the operations for the long term.

        [Dirk Hufnagel] 12:12:26
        Information

        [Jerome Lauret (he/him)] 12:12:29
        usually when you have those long term credentials, so you also have to demonstrate that you have a way to revoke it first.

        [Dirk Hufnagel] 12:12:41
        I don't know how to handle it. You would probably have to go through nurse, because I don't think the you can revoke it yourself.

        [Jerome Lauret (he/him)] 12:12:48
        yeah, exactly. So this may be a concern in a long term.

        [Jerome Lauret (he/him)] 12:12:52
        Yes, I'm just saying, in terms of visibility, people may not know the detail, but usually that's one of the things that would come when a long credentials appeal

        [Robert Hancock] 12:13:05
        yeah, and in our plan, with the vote, right? So the long-standing credentials would state the Volt.

        [Robert Hancock] 12:13:11
        Server, so we could just delete them from there, and then they wouldn't be able to pull any more short term credentials, you know.

        [Robert Hancock] 12:13:15
        Short term tokens, we access token

        [Enrico Fermi Institute] 12:13:27
        Okay.

        [Dirk Hufnagel] 12:13:31
        if you don't have any more comments on security topics, it could me move on to the allocations and that's more like acquiring the resources.

         

        ALLOCATIONS

        [Enrico Fermi Institute] 12:13:42
        Okay.

        [Dirk Hufnagel] 12:13:45
        So an Hbc: you do it through. It's currently yearly.

        [Dirk Hufnagel] 12:13:48
        Our locations. So we are. It was mentioned already in the Hbc. Focus area discussions.

        [Dirk Hufnagel] 12:13:55
        This: if you had multi year locations, that would you reduce that, hey?

        [Enrico Fermi Institute] 12:13:55
        Okay.

        [Dirk Hufnagel] 12:14:00
        Reduce the effort to acquire the Hbc resources because you wouldn't have to constantly rejustified every year and B.

        [Dirk Hufnagel] 12:14:09
        It would also open up possibilities to include sizable Hbc.

        [Dirk Hufnagel] 12:14:15
        allocations in the in the planning process which you can do right now, because at the moment you write the proposal, you get the decision, and then, usually on the order of few months, later 1, 1, 2, few months later, you get that you actually have the resources, and you don't actually get the decision, until a few

        [Dirk Hufnagel] 12:14:37
        months before which is too late to actually include it in a name of the long term, in in in the long term, planning planning process for research use in the experiments, and the that's that's from that side independently of of any kind of pledging problems we have that's that's

        [Dirk Hufnagel] 12:14:55
        a That's a problem in in and being able to pledge.

        [Dirk Hufnagel] 12:14:58
        I mean, if you don't know that we have the resources we cannot pledge it, even if there would be procedures in place to be able to do so technically, and then what was mentioned also many times before is that a large stored allocations with connectivity to the white area network would

        [Dirk Hufnagel] 12:15:14
        allow, would simplify Hbc operations, which was basically would make some sinks possible.

        [Dirk Hufnagel] 12:15:23
        that might not be possible now, and it definitely would reduce the cost

        [Enrico Fermi Institute] 12:15:29
        Yeah, okay.

        [Dirk Hufnagel] 12:15:30
        And on the cloud side of Anando. You want to say something on the subscription model

        [Fernando Harald Barreiro Megino] 12:15:37
        well, I mean I'm not sure exactly how thing works.

        [Fernando Harald Barreiro Megino] 12:15:44
        At the end. The cloud, that's as long as they they get the check on the on the subscription is renewed, and also the there are come on like there is a common understanding out what is going to be the cost of the subscription.

        [Fernando Harald Barreiro Megino] 12:16:02
        It's okay. But then I don't know in in in Atlas how how the budgeting works will prepare those a yearly parts it for that

        [Dirk Hufnagel] 12:16:18
        But I'm really curious about what is what will happen after the it what is it?

        [Dirk Hufnagel] 12:16:24
        Fifth, 15 months.

        [Fernando Harald Barreiro Megino] 12:16:25
        Yes, it's around October 2023.

        [Fernando Harald Barreiro Megino] 12:16:29
        That's fine.

        [Dirk Hufnagel] 12:16:30
        We'll we'll see what I mean. I really would like to see what happens.

        [Dirk Hufnagel] 12:16:32
        Sand. If they just renew it at the same, or if they actually drilling into the into the billing data that they collect and and do some, I mean, it depends, I guess, on the billing data.

        [Enrico Fermi Institute] 12:16:40
        Okay.

        [Dirk Hufnagel] 12:16:46
        But but still I I'm curious.

         

        HPC - CVMFS & Rucio compatible storage

        [Dirk Hufnagel] 12:16:52
        And then some specific topics, Here, on the Hpc.

        [Dirk Hufnagel] 12:16:56
        Side we we mentioned that the facilitating Cbm Fs access.

        [Dirk Hufnagel] 12:17:01
        I think this is small as a self problem, because see what as as Brian said, if excess is, is considered kind of stable, and the solution to provide.

        [Enrico Fermi Institute] 12:17:03
        Okay.

        [Dirk Hufnagel] 12:17:12
        Cdfs these days. So in that that basically immediately gets you to the second problem, you need to have the ability to or either have a some squid infrastructure in place, or the ability for to launch our own because that supports Cdm Fs exec And then there's frontier on top of it but

        [Dirk Hufnagel] 12:17:33
        the first of all access at facilities access to software. Oh, is it as a comment

        [John Steven De Stefano Jr] 12:17:42
        Yeah, I was just wondering in general, on the Hpc side when it comes to Cbn Fs access.

        [John Steven De Stefano Jr] 12:17:48
        What the the mean issue is with the native client. Is it just connectivity that's restricted on the

        [Dirk Hufnagel] 12:17:54
        It's usually that they don't want to install custom software for just one customer

        [John Steven De Stefano Jr] 12:17:59
        But they'll, and they'll use Cbm physics sick

        [Dirk Hufnagel] 12:18:02
        Well, that runs completely in user space. The latest versions It's it's becoming increasing to I mean, I've reached.

        [John Steven De Stefano Jr] 12:18:05
        True

        [Dirk Hufnagel] 12:18:10
        We've started using it like a year or 2 ago, like 2 years ago, and it's becoming increasingly easier to use it because the newer machines run new operating systems with newer kernel features.

        [Enrico Fermi Institute] 12:18:16
        Yeah.

        [Dirk Hufnagel] 12:18:21
        And it's basically with this at this level, you can run it completely in user space.

        [Dirk Hufnagel] 12:18:27
        You don't you? Really? The system dependencies are so small these days. If the kernel is new enough that that it kind of it just works

        [Enrico Fermi Institute] 12:18:36
        In in the past at least, you know, justified or not.

        [Enrico Fermi Institute] 12:18:39
        There there was certainly some paranoia I I have seen about, you know, running fuse file systems on a compute note.

        [Enrico Fermi Institute] 12:18:48
        some some sites were worried about that. Like doesn't have any

        [Dirk Hufnagel] 12:18:53
        Yeah, yeah, some sites, yeah, some some. Hpc: size, you log into a batch node, and like fuse amount, is not available.

        [Dirk Hufnagel] 12:19:00
        But that's not a If the kernel is new enough.

        [Dirk Hufnagel] 12:19:02
        See me, Xxx doesn't need fuse amount binary to do a fuse mount.

        [Dirk Hufnagel] 12:19:07
        You can do it directly through Yeah.

        [John Steven De Stefano Jr] 12:19:11
        Sure, and I understand the concern about views being another layer on top of, and already simplified or complex system.

        [John Steven De Stefano Jr] 12:19:18
        But I think the native client has proven fairly stable lately, so I understand the concerns.

        [Enrico Fermi Institute] 12:19:24
        So it's just it's just convincing the sites.

        [Enrico Fermi Institute] 12:19:26
        That's the case. Okay.

        [John Steven De Stefano Jr] 12:19:27
        Thanks.

        [John Steven De Stefano Jr] 12:19:27
        Yeah.

        [Dirk Hufnagel] 12:19:33
        And then another area. The Hbc. If you have it, makes so much simpler, if you, if they would provide right, rushio compatible storage we are we're currently working with with nurse on that Lcf: I don't think I mean there's there's no efforts

        [Dirk Hufnagel] 12:19:51
        there, and I I'm not sure if it will ever happen.

        [Dirk Hufnagel] 12:19:54
        But at least they support globus online. So we do have a grocery club gloves online integration.

        [Dirk Hufnagel] 12:20:01
        So it's it's doable

        [Douglas Benjamin] 12:20:05
        So then we just have to make sure that we call out that there's a hop that's required

        [Dirk Hufnagel] 12:20:12
        Yeah, I mean, we. We tried. I don't know if you ever tried.

        [Dirk Hufnagel] 12:20:15
        The multi hop. We We tried it through, nurse, and it just worked

        [Dirk Hufnagel] 12:20:21
        So we had nurse. Currently We have nurse currently integrated via the still existing good Ftp.

        [Dirk Hufnagel] 12:20:25
        Integration which will eventually go away. But it's still there, for now and then both nurse and hey, I'll see a data where integrated into be a global. Online. And then basically it's it's you can configure.

        [Dirk Hufnagel] 12:20:41
        Configure the roo system. So that when you, when you put in a rule, we create some data, theta it automatically. First, that's a good ftp trap transfer to nurse and then immediately global online transfer from nurse to theta

        [Douglas Benjamin] 12:21:02
        but will that work? When nurse goes to the next generation, or globus

        [Dirk Hufnagel] 12:21:07
        No, but that's that's what the work with nurse, where the work with nurse on the d interface is important, because that eventually hopefully will replace the good fft integration which which is deprecated since many years and will go away

        [Douglas Benjamin] 12:21:28
        So Sam, us is planning to keep nurse in the Hpc data flow path for essentially Nsf.

        [Douglas Benjamin] 12:21:39
        And other doe. Hpcs

        [Douglas Benjamin] 12:21:44
        Versus putting Fermi lab in the path, so that Fermi lab becomes the connector

        [Dirk Hufnagel] 12:21:45
        Yes, I'm

        [Dirk Hufnagel] 12:21:53
        where we exactly put the multi hop is still to be decided, and ask, is is an obvious candidate.

        [Dirk Hufnagel] 12:21:58
        But we also, I think we have tattoos with global online licenses.

        [Dirk Hufnagel] 12:22:02
        So that would be some that would be an alternative option that we have

        [Douglas Benjamin] 12:22:11
        because I have less uses, you know, as the Hop

        [Dirk Hufnagel] 12:22:14
        Yeah.

        [Dirk Hufnagel] 12:22:23
        I mean the at the moment, at the level of transfers, we need to do to the to the Lcf.

        [Enrico Fermi Institute] 12:22:29
        See.

        [Dirk Hufnagel] 12:22:30
        It's it's not that important where the hop location is.

        [Dirk Hufnagel] 12:22:33
        If we scale up Lcf Usage and we really looking at the future.

        [Dirk Hufnagel] 12:22:38
        That's like heavy on like data reconstruction, or so then it becomes, and more important question, because that's potentially a lot of traffic you have to multi-hop

        [Enrico Fermi Institute] 12:22:54
        Okay.
         

        PEERINGS & ANYTHING ELSE

        [Dirk Hufnagel] 12:23:08
        and then we had a point here that we on on network traffic and pairings to improve connectivity and reduce the limited e rate cost.

        [Dirk Hufnagel] 12:23:18
        I think we had an interesting presentation from me, as not yesterday, about the connectivity side of things.

        [Dirk Hufnagel] 12:23:22
        I don't think we got anywhere with the reduce unlimited egret cost.

        [Dirk Hufnagel] 12:23:26
        That's kinda that's more. That's not so much a question of the how this, how the the networks are connected, and what peering how the peering is set up but more question of what type of cost model you have you have a subscription or you use a cloud that doesn't

        [Dirk Hufnagel] 12:23:42
        have egret. Yeah, it seems that's the outcome I get out of this Berkshire

        [Dirk Hufnagel] 12:23:50
        And then there's an open anded question, What else is there anything?

        [Dirk Hufnagel] 12:23:54
        We forgot to cover here that that could help with the without our Hbc.

        [Enrico Fermi Institute] 12:24:12
        and I think that's that's what we have for this session.

        [Enrico Fermi Institute] 12:24:18
        Call current standards.

        [Enrico Fermi Institute] 12:24:23
        Yeah, So you know, if there are other things that we that we should talk about, for you know, facility, features, and policies, or you know any of the the topics that we covered in previous days I think it, would be a good time, to bring them up now

        [Enrico Fermi Institute] 12:24:44
        otherwise we can. We can go about the session a little early

        [Paolo Calafiura (he)] 12:24:54
        when do we reconnect? If we

        [Enrico Fermi Institute] 12:24:57
        the next session will be at at one o'clock central time.

        [Paolo Calafiura (he)] 12:25:03
        Okay.

        [Enrico Fermi Institute] 12:25:17
        seeing people disconnect. So maybe we'll just go ahead and and and close out, and then resume in an hour and a half

        [Dirk Hufnagel] 12:25:23
        sounds good

        [Enrico Fermi Institute] 12:25:24
        Okay, So it's guessing

        [Paolo Calafiura (he)] 12:25:24
        hi folks.

        [Fernando Harald Barreiro Megino] 12:25:25
        okay.

      • 11:00
        Discussion 1h
    • 12:00 13:00
      Lunch Break 1h
    • 13:00 15:00
      Third Day Afternoon

      [Eastern Time]

       

      [K-T Lim] 13:54:53
      Hello!

      [Fernando Harald Barreiro Megino] 13:54:57
      We'll start in 5 min or so.

      [K-T Lim] 13:54:59
      Yup! No Russia!

      [Enrico Fermi Institute] 13:59:30
      let's let's see what we've been doing previously, and try to wait till maybe 5 after before we get started, and I think then we'll we'll jump right into rotation

      [Enrico Fermi Institute] 14:01:53
      we'll do it just a couple of minutes. Here.

      [K-T Lim] 14:02:08
      I only figured out how to log in discern

      [K-T Lim] 14:02:22
      So I can upload

      [Enrico Fermi Institute] 14:02:35
      Okay.

      [Enrico Fermi Institute] 14:02:38
      Yeah, Just let me know, Kt: when you're ready to to start sharing.

      [Enrico Fermi Institute] 14:02:43
      probably wait for people to into the room. I've hosted the sort of relevant charge question Here, What can us, Alice?

      [Enrico Fermi Institute] 14:02:53
      Us. The related international efforts, like

      [K-T Lim] 14:02:59
      Yup, so I'm ready to share any time, and I've uploaded a Pdf.

      [K-T Lim] 14:03:08
      Of my slides and presenter notes to the indicator page

      [Enrico Fermi Institute] 14:03:11
      Okay, awesome. Thank you. Yeah. In the morning session we had a few more people.

      [Enrico Fermi Institute] 14:03:16
      So let's see if we can We can get some of those folks back, maybe like 2 more minutes.

      [K-T Lim] 14:03:17
      Okay.

      [K-T Lim] 14:03:23
      Sure.

      [Enrico Fermi Institute] 14:03:23
      And then, yeah, we can serve

      [K-T Lim] 14:03:29
      It's a little early for lunch here, but

      [Enrico Fermi Institute] 14:03:52
      okay.

      [Enrico Fermi Institute] 14:04:34
      okay, Maybe let's go ahead and get started. I'm gonna stop sharing Kt: and then you can.

      [K-T Lim] 14:04:39
      okay, okay.

      [Enrico Fermi Institute] 14:04:39
      You can start sharing your slides

      [K-T Lim] 14:04:52
      Let me see

      [K-T Lim] 14:04:53
      Let me see! Oh, geez

      [Enrico Fermi Institute] 14:05:03
      Okay.

      [K-T Lim] 14:05:07
      Hold on need to grant access, because apparently I updated something

      [K-T Lim] 14:05:07
      hold on!

      [K-T Lim] 14:05:25
      okay, I will be back in a second

      [Enrico Fermi Institute] 14:05:27
      okay.

      [K-T Lim] 14:06:01
      hey? Hopefully, this works better. That looks a lot better.

      [Enrico Fermi Institute] 14:06:08
      Okay, great.

      [K-T Lim] 14:06:08
      okay.

      [K-T Lim] 14:06:16
      Oh, is that good

      [Enrico Fermi Institute] 14:06:17
      Yep.

      [K-T Lim] 14:06:21
      Okay? Well, that's good started then. Thank you very much for inviting me Happy to share a little bit about the ribbon observatories experienced with cloud computing.

      [K-T Lim] 14:06:37
      how we got to where we are, where we think we are, and where we're going.

      [K-T Lim] 14:06:44
      Let me just start by saying that usually clouds are very bad for a strong numbers.

      [K-T Lim] 14:06:49
      You can see some clouds on the horizon over Laserna, in the left, and then there are plenty of desk clouds in the Milky Way, and all of those block use of things that as tremors like to see but in this, case they're actually pretty good so we

      [K-T Lim] 14:07:02
      we like the way that the cloud is working out for us.

      [K-T Lim] 14:07:07
      What is the Ruben Observatory doing? The Ruin Observatory is being built on top of a mountain in Chile, in order to perform the legacy survey of space and Time the survey will scan the scary, taking 20 TB tonight of school 30 s images

      [K-T Lim] 14:07:23
      that'll cover the entire visible sky every few days.

      [K-T Lim] 14:07:25
      This is essentially a movie of the whole sky, or at least that we can of the sky that we can see in the we have several different data products that are produced on different cadences.

      [K-T Lim] 14:07:36
      So first of all, we have prompt data products that generate primarily alerts.

      [K-T Lim] 14:07:43
      these are indications that something has changed in the sky from what it used to be, And so we need to process the images from the telescope within 60 s to issue those alerts, So the other.

      [K-T Lim] 14:07:57
      Telescopes thing, then follow them up, and and and observe things that have changed our data, release Production executes once a year approximately, and it reprocesses all images that have been taken to date using a consistent set of algorithms and configurations and so

      [K-T Lim] 14:08:16
      that's obviously a data set that's growing each time, and the complexity of the analysis is likely to grow each time as well.

      [K-T Lim] 14:08:24
      so that that needs to to oh, go faster and faster as we're as we're progressed, saying, because we want to issue one day, release each year, and finally, definitely, not at least, we have the Ruben science platform which provides access to the data.

      [K-T Lim] 14:08:42
      Products and services for all science users, and project staff to do analysis and reprocessing of the data that has been taken, not shown on this slide, but also important is our internal stuff.

      [K-T Lim] 14:08:56
      Developers need to do both ad hoc and production style processing as well So that's another Sync: Okay, and storage.

      [K-T Lim] 14:09:05
      So a kind of architecture that we have to actually perform this as a data management system that looks like this.

      [K-T Lim] 14:09:15
      Here we have the telescope as kind of an input device off on the left hand side.

      [Enrico Fermi Institute] 14:09:18
      Okay.

      [K-T Lim] 14:09:19
      My colleagues who are working on actually building the thing that's pictured behind me would argue that they're doing a lot of the work.

      [K-T Lim] 14:09:27
      But we think that most of it is in the data management system over here.

      [K-T Lim] 14:09:30
      so we grab stuff at the summit on the left hand side of the Us. Data facility.

      [K-T Lim] 14:09:35
      We have the prom processing chain that's on running in near real time. And issuing alerts.

      [K-T Lim] 14:09:41
      After this hands community in the middle of the right hand side of this diagram we have offline processing that is, executing in sort of batch mode.

      [K-T Lim] 14:09:51
      It's high throughput computing, not high performance computing.

      [K-T Lim] 14:09:54
      and it's running across multiple sites. We have partners, and in France, at Cci and and in the Uk who will be executing large portions of the data release production.

      [K-T Lim] 14:10:06
      And and then finally at the bottom, and in the upper right.

      [K-T Lim] 14:10:10
      we have dedicated resources for this science user access and analysis on the Ruben Science platform.

      [K-T Lim] 14:10:18
      I'll talk about that more later

      [K-T Lim] 14:10:23
      We did a number of proof of concept engagements to try to determine how the cloud could work with us, and with this architecture.

      [K-T Lim] 14:10:32
      so we did 3 different engagements, with 2 separate cloud vendors, and they're documented in a bunch of data management, technical notes, which are all linked from this page, the first one in each series is the the goals of the engagement.

      [K-T Lim] 14:10:49
      What we set out to do, and then we have a report of what we actually manage to accomplish.

      [K-T Lim] 14:10:54
      So the first engagement we mostly leverage to get sort of cloud native experience, and how to deploy services.

      [K-T Lim] 14:11:05
      And systems in modern technologies to improve our deployment models, to get things containerized, etc.

      [K-T Lim] 14:11:14
      and not just have them running as shell scripts or things that an individual developer ran.

      [K-T Lim] 14:11:20
      we learned about potential bottlenecks, and how I been with delay product networks.

      [K-T Lim] 14:11:25
      Obviously, we're transmitting data from from Chile to the Us.

      [K-T Lim] 14:11:29
      that's over 200 ms, and it's a 100 gigabit network.

      [K-T Lim] 14:11:35
      So very high, bandwidth. So we need to get the data, And we need, we need to make that work efficiently.

      [K-T Lim] 14:11:41
      And so there were a number of bottlenecks.

      [K-T Lim] 14:11:42
      There that we worked through, And we learned about how to interact with the vendors, what mechanisms and and ways of working with them.

      [K-T Lim] 14:11:53
      worked well for us and for them. The second engagement was with a different vendor.

      [K-T Lim] 14:12:00
      We tested workflow execution, Middleware. This is some of our custom.

      [K-T Lim] 14:12:05
      middleware at a modest scale, up to about 1,200 virtual cpus, and we were able to make use of spot or preemptable instances, to run a lot of our processing It's easy to retry something a a particular quantum of processing if you feel for some reason if

      [K-T Lim] 14:12:23
      the if the processor went away, and that reduced system by a considerable amount.

      [K-T Lim] 14:12:29
      When you allow preemption that way, and the third engagement we tested improved workflow execution, middleware.

      [Enrico Fermi Institute] 14:12:36
      Yeah.

      [K-T Lim] 14:12:37
      so actually at a similar scale, up to 1,600 Vcpus.

      [K-T Lim] 14:12:45
      and here we we also did some transfers over the long call network again, and learned about the desirability of having http to persistent connections for uploading to object stores in particular sure so all of these taught us something about working with the cloud and some

      [K-T Lim] 14:13:07
      of the things that we learned that people don't necessarily talk about a lot, or that when we were able to work with a vendor who had relatively low bureauucracy high flexibility and a willingness to assist you know at our well to find point of

      [K-T Lim] 14:13:26
      contact and rapid internal processes that made things work much more smoothly.

      [K-T Lim] 14:13:33
      As we went through these engagements, and through a subsequent working with these vendors, deep engagement with the vendors and engineering teams, being able to talk to the actual product managers and even in in some cases engineers who are working on these products, also was useful and something that turned

      [K-T Lim] 14:13:53
      out that was quite unexpected is that consultants can also be very useful.

      [K-T Lim] 14:13:59
      So there are a number of consultants who are obviously fully trained and certified for building things on these vendors.

      [K-T Lim] 14:14:08
      Clouds, there. They don't know any more than the you know.

      [K-T Lim] 14:14:14
      People, at the vendors necessarily, but unlike the vendor engineers, they are allowed to work on your code The The vendors can't work on your code that that would cause.

      [K-T Lim] 14:14:26
      All kinds of problems, especially since in our case all of our code is open source.

      [K-T Lim] 14:14:31
      but the consultants can. They can actually modify things and and update your own code to work better in the cloud.

      [K-T Lim] 14:14:40
      And so that was that was something that turned out to be very interesting.

      [K-T Lim] 14:14:47
      we did a lot of cost modeling, so we have already had very complex internal spreadsheets to understand what our data sizing and compute sizing requirements would be.

      [K-T Lim] 14:15:02
      we adjusted them somewhat to fit the cloud storage models, and and how things will work there, and the vendor our vendors also produced spreadsheets that Then match.

      [K-T Lim] 14:15:17
      Those needs to the available technology cheese and their their quoted prices for them.

      [K-T Lim] 14:15:24
      so in our case, our compute costs compared with high energy.

      [K-T Lim] 14:15:29
      Physics are not that large, and we're talking only something in the millions of core hours, and that's only in year 10 of the survey.

      [K-T Lim] 14:15:37
      when we're doing the maximum memory processing of of the entire survey contents.

      [K-T Lim] 14:15:42
      So that's quite reasonable. The storage costs, on the other hand, for frequently access data turned out to be a major. Problem.

      [K-T Lim] 14:15:50
      we are expecting to have hundreds of petabytes of results that are both sort of in process that are that are the that are being developed.

      [K-T Lim] 14:16:01
      For the next data release as well as the results that are part of the previous data releases that are already public.

      [K-T Lim] 14:16:08
      So those force costs can be very large. And I we have had a number of you know.

      [Enrico Fermi Institute] 14:16:13
      Okay.

      [K-T Lim] 14:16:19
      It's kind of debates about why the on prem storage calls seem to be less than the in cloud storage costs.

      [K-T Lim] 14:16:26
      I mean, in some cases it's because the total cost of ownership is, it's somewhat different.

      [K-T Lim] 14:16:31
      sometimes things like people, like administrators, can be charged different accounts, and they don't actually fall under their projects.

      [K-T Lim] 14:16:37
      Budget, but I think a lot of it is also that in the cloud you're paying for more durability and performance.

      [K-T Lim] 14:16:44
      The then we often need in science, right? So in sense, we often have replicas of the data in other places, so we don't need, you know, 8 nine's or something like that worth of of during our Bill in one place and also we can schedule, when we're going to access

      [K-T Lim] 14:17:02
      data often, and so we don't need the kind of perfect that you might need for commercial workloads.

      [K-T Lim] 14:17:09
      Egress. Costs are often a problem, but there are mitigations that can exist, and of course, if you can keep most data, that's most data transfers, either inbound to the cloud or or entirely within.

      [K-T Lim] 14:17:21
      The cloud. Then there are no egress costs, and so that helps a lot.

      [K-T Lim] 14:17:26
      if we managed to do most of the data summarization and visualization within the cloud, and then only have the results.

      [K-T Lim] 14:17:33
      Exit that also limits the the egress quite a bit.

      [K-T Lim] 14:17:38
      the vendors tend to give credits for egress based on the total amount of spending that you're doing on all the other services that you're buying from them.

      [K-T Lim] 14:17:47
      And so those credit is, can also help minimize the egress costs.

      [K-T Lim] 14:17:51
      And finally we did look at, but have not yet moved on getting a dedicated interconnect.

      [K-T Lim] 14:17:58
      So with a dedicated intercourse, connect, you're not using the public Internet or or the the public egresses, and as a result, there can be substantial discounts on the egress, costs because it's a kind of paid for in a lump sum rather than on a per

      [K-T Lim] 14:18:14
      byte or per gigabyte basis. So the final decision that we made was to have a hybrid model.

      [K-T Lim] 14:18:21
      So we have most of the storage and the large scale compute.

      [K-T Lim] 14:18:25
      And I'll explain why. In a second on Prem, at the Us.

      [K-T Lim] 14:18:29
      Data facility which is located at slack national accelerator lab.

      [K-T Lim] 14:18:33
      so The users, however, will be supported in the cloud on a cloud.

      [K-T Lim] 14:18:37
      Data, facility, not that is, actually vendor agnostic, but we're anticipating that it will be at the on Google side platform for various reasons.

      [K-T Lim] 14:18:47
      And so it looks something like this: We have again the telescope sending data to the Us.

      [K-T Lim] 14:18:52
      Data facility and the data release processing and prom processing both occur there with the main archive storage.

      [K-T Lim] 14:19:00
      But in the cloud data facility, we have the ribbon Science platform services.

      [K-T Lim] 14:19:05
      we have a cash of data that's both, for for relatively small data sets that we can copy in their entirety and for partial storage of other data sets that are than are being used frequently and also per user storage would also be

      [K-T Lim] 14:19:28
      stored entirely in the cloud. This shows the user batch would be executed at the Us data facility so that it could run against the archive storage.

      [K-T Lim] 14:19:37
      And I'll talk about that, and where those dividing lines might be in a bit

      [K-T Lim] 14:19:42
      So. The ribbon Science platform What is it really? Again, It's for us.

      [K-T Lim] 14:19:48
      It's for our science users who are coming to use dedicated resources that are provided by the project to access our large data sets and use web based applications on them.

      [K-T Lim] 14:20:01
      So there's a core. Yeah, where it provides access and visualization and sort of structured expeditions through the through the data set with query generation tools as well as lots of visualization, including joint visualization visualization, of images and

      [K-T Lim] 14:20:23
      catalogs we then have Jupiter notebooks.

      [K-T Lim] 14:20:29
      it's actually quite common now. Was not that common?

      [K-T Lim] 14:20:31
      a few years ago, when we were starting out on this vision.

      [K-T Lim] 14:20:37
      But that's for more ad hoc analysis by users, and then we have web Api's.

      [K-T Lim] 14:20:43
      These are web services that are, have interfaces that are defined by the international Virtual Observatory Alliance for astronomy that provide access to both images, both raw and processed images as well as catalogs of things, seen on those images and so that

      [K-T Lim] 14:21:04
      provides. Excuse me, both remote access and and a little bit of processing and that we can do things like.

      [K-T Lim] 14:21:11
      Cut it out, sections of images or paste together images.

      [K-T Lim] 14:21:15
      So this is the the user experience. The users will have, And behind those 3 major aspects there are the data releases an alert filtering service user databases user files, all kinds of other infrastructure that's necessary to make that work.

      [K-T Lim] 14:21:33
      So our uses of cloud services. Obviously the primary one is going to be the revenue and science platform.

      [K-T Lim] 14:21:41
      the reasons for putting this in the cloud alright include these.

      [K-T Lim] 14:21:46
      So there's security by putting this in the cloud.

      [K-T Lim] 14:21:48
      we can use separately manage identities that have nothing to do with the identities at our on-prem facilities at slack.

      [K-T Lim] 14:21:59
      so all of our users do not need to get slack accounts.

      [K-T Lim] 14:22:03
      this is very important, because department of energy has a lot of restrictions and and it's not necessarily very rapid at generating accounts at labs.

      [K-T Lim] 14:22:14
      So, having being able to maintain our own accounts, makes things much more efficient, and allows us to integrate with things like federations that we couldn't otherwise necessarily do.

      [K-T Lim] 14:22:26
      it also means we have a good story for cyber security at the lab, because we have very relatively limited interfaces with the onframe facilities.

      [K-T Lim] 14:22:34
      There are dedicated, there are certain services that be queried from the cloud, and those can be listed and tracked, and understood.

      [K-T Lim] 14:22:46
      A huge benefit is elasticity right? So, especially after we have an annual data release, we're expecting the hordes of astronomers.

      [K-T Lim] 14:22:55
      Will descend on us and and want to look at what's new in that release.

      [K-T Lim] 14:22:59
      this might also happen. For example, around key conferences, when people are trying to do work So in the cloud we have essentially infinite elasticity.

      [K-T Lim] 14:23:07
      We can see up the rubber science platform by deploying more notebook servers, more api servers. And even more portal servers arbitrarily, and so you can hint we're expected to be able to handle those loads relatively easily the back end

      [K-T Lim] 14:23:26
      services. It could be an issue. But we can. We can also do that in a scalable manner, using object stores, and scalable files, scalable distributed file systems and a scalar distributed database in the back end potential advantage.

      [K-T Lim] 14:23:47
      we are looking to prove, but haven't quite yet.

      [K-T Lim] 14:23:51
      is that you could bring your own resources in the cloud.

      [K-T Lim] 14:23:56
      So if a science user had a grant or some other means of providing cloud resources on the same cloud, the cloud vendor that we're using, they confederate those resources with the ones that are already press events for the ribbon science platform and essentially expand their

      [K-T Lim] 14:24:18
      capabilities. Okay, kind of compared with trying to actually purchase hardware and slack lab or or send computers or something like that. This is much.

      [K-T Lim] 14:24:29
      Much much much much easier. And so it gives people the ability to use all the same facilities, software and user interfaces that they're familiar with at a larger scale, just by adding on to what's present And finally the cloud can also provide access to new technologies things like

      [K-T Lim] 14:24:52
      gpus Tpus or software technologies like sophisticated infrastructure services that are harder to deploy at a lab on premises.

      [K-T Lim] 14:25:03
      that's and again you don't need to buy them and keep them working.

      [K-T Lim] 14:25:10
      31. You can rent them when you need them, and then further away.

      [K-T Lim] 14:25:15
      So for the large scale compute we have executed fairly large production.

      [K-T Lim] 14:25:25
      This is our data preview 0 point 2, which is only for 5 years and only for a small portion of the sky, not not the full data release.

      [K-T Lim] 14:25:35
      Production but we did. We're able to actually done on larger numbers of nodes, 4,000 virtual cpus again.

      [K-T Lim] 14:25:42
      Not that much compared to high energy physics necessarily, but pretty large for what we're doing, and but we're not expecting to execute the main survey data release production.

      [K-T Lim] 14:25:51
      On this the cost of storing or addressing the large, their process products is too excessive to do that.

      [K-T Lim] 14:25:58
      we might be able to do user batch in the cloud.

      [K-T Lim] 14:26:00
      But it. It'll have some of the same drawbacks, and that works expecting user batch jobs to also want to process, large fractions of the the available data.

      [K-T Lim] 14:26:11
      And so transmitting all of those into the cloud, or and even storing them temporarily, can have some difficulties.

      [K-T Lim] 14:26:20
      But if we were able to do it, if we can get the caching and the sort of automated transfers working well, then there would be the advantages of having the again the security and technology kind of advantages that we would not have on premises right now

      [K-T Lim] 14:26:43
      we're going to require the users who wants to execute that large scale?

      [K-T Lim] 14:26:49
      Batch jobs get slack accounts, and that made may eventually become a problem

      [K-T Lim] 14:26:55
      we've found the cloud to be extremely useful for development. Testing.

      [K-T Lim] 14:26:58
      Again the elasticity being able to scale up at at will, and that technology advantages of being able to use new machines, large amounts of flash storage, for example, things, like that that are not easily purchased especially now with supply chain issues in an on-premise model

      [K-T Lim] 14:27:21
      has been very helpful for development, and we've also been able to do things like rapid prototyping with advanced services such as server lists, all kinds of deployments.

      [K-T Lim] 14:27:31
      There is a possible future I mentioned. We have a distributed, scalable database that runs on on-premises that will handle and serve a lot of the catalogs that are being generated for the stores and galaxies that we're detecting on these images that

      [K-T Lim] 14:27:48
      database has been customized for astronomy, and has a lot of advantages.

      [K-T Lim] 14:27:53
      one is that it does spherical geometry, which it's kind of difficult and a lot of in a lot of databases alright.

      [K-T Lim] 14:28:02
      It does share. It's what cost shared scans, where multiple queries that are touching the same tables.

      [K-T Lim] 14:28:09
      a do share ios essentially, and that makes things much more efficient and can provide, I guess, well understood maximum query times, while the minimum query times may increase the maximum query times for certain types of queries can be limited and so

      [Enrico Fermi Institute] 14:28:31
      Okay.

      [K-T Lim] 14:28:32
      we can, we can guarantee that your query will finish in a certain amount of time.

      [K-T Lim] 14:28:38
      we also have special indexes, especially spatial ones, that allow us to do astronomical types of queries much more efficiently so.

      [K-T Lim] 14:28:49
      A lot of these differentiators are kind of going away with cloud deployments, Spherical geometry is becoming more available through gis kinds of packages.

      [K-T Lim] 14:28:59
      The shared scan is still a win. But when everything is on Nvme flash the number of iops is so high, that you know you can do individual ios for each query without actually loosing a lot a special indexes that we have are still

      [K-T Lim] 14:29:18
      a bit of an issue are are still better in house than they are in the cloud, and retrofitting them to the cloud.

      [K-T Lim] 14:29:25
      Databases is difficult, and finally, storage costs, Course can still be an issue again, because week we can do this cheaper, and in-house rather than using the cloud storage, and then finally users, in the cloud for archival or tape replacement storage

      [K-T Lim] 14:29:47
      maybe comparable in terms of total cost of ownership.

      [K-T Lim] 14:29:51
      This is something we're still investigating, especially if you don't retrieve the data.

      [K-T Lim] 14:29:56
      If you do have to retrieve the data, then there are large egress costs again to get it out of the cloud and engineer on premises, storage and so that, becomes an issue.

      [K-T Lim] 14:30:07
      But if you're in that kind of a disaster situation, it may not be that bad

      [K-T Lim] 14:30:15
      one other aspect of the cloud that has been kind of important, I guess, is that is reliability.

      [K-T Lim] 14:30:22
      so, while I mentioned that in some cases the durability of storage might be overkill and other faces.

      [K-T Lim] 14:30:29
      Well, we do actually experience higher higher reliability and higher ability to deliver, to our end users by deploying on the cloud than on premises.

      [Enrico Fermi Institute] 14:30:37
      Okay.

      [K-T Lim] 14:30:38
      first of all, one of the the may be negatives is that we've seen that kubernetes upgrades will roll through Our clusters semi arbitrarily there are some controls that you can put on them.

      [K-T Lim] 14:30:50
      But the the vendors kind of want to update it when they want to update it.

      [K-T Lim] 14:30:55
      So we need to make sure we've designed services to deal with these kinds of rolling outages.

      [K-T Lim] 14:31:01
      not all of them are, but we can. We will adjust them over time again.

      [K-T Lim] 14:31:06
      The durability of storage is extremely high, maybe more than necessary service outages are quite rare and usually short compared with some of the outages that we've had on prem and 24 by 7 support for basic infrastructure and even for higher level services is often better than we have

      [K-T Lim] 14:31:24
      on prayer, where it may just be 8 by 5 essentially So while sometimes the reliability in the cloud is more than you need, and so you're paying for more, than you actually need.

      [Enrico Fermi Institute] 14:31:33
      And

      [K-T Lim] 14:31:41
      In other, cases, it's actually it can be an event so we're trying to wrap up here.

      [K-T Lim] 14:31:47
      conclusion and status and Plans The hybrid model seems to be suitable for our use.

      [K-T Lim] 14:31:52
      Cases. We are practicing today with an interim data facility on the Google Cloud platform, which hosts simulated data until the telescope is built.

      [K-T Lim] 14:32:04
      we're we're working with that. So, But to give scientists a chance to work with data that looks like the real thing.

      [K-T Lim] 14:32:09
      And and using all the tools that they will eventually have.

      [K-T Lim] 14:32:13
      We're building out our back end on prem infrastructure to practice the integration with the cloud and tune, the various caching parameters.

      [K-T Lim] 14:32:23
      About What gets sent to the cloud went, and we are obviously continuing to track development and cloud services and pricing.

      [K-T Lim] 14:32:32
      and I'm happy to answer any questions

      [Enrico Fermi Institute] 14:32:41
      Okay. So we have a couple of hand-raced Lindsey.

      [Enrico Fermi Institute] 14:32:45
      Why don't you go first

      [Lindsey Gray] 14:32:48
      yeah, sure, actually, just a quick operational one about the fact that they're rolling through kubernetes, upgrades kind of at that.

      [Lindsey Gray] 14:32:57
      Their own whim in particular, since Kubernetes is up updating the spec and the interface that you're talking to how much maintenance burden have you found that to be as the spec makes backwards and compatible changes

      [K-T Lim] 14:33:16
      the the upgrades of Kubernetes itself have typically not been too much of a problem.

      [Enrico Fermi Institute] 14:33:17
      See.

      [K-T Lim] 14:33:26
      we don't run on the the we, well, we run our development clusters on sort of the latest, more bleeding edge versions, and our production clusters on the more stable versions.

      [K-T Lim] 14:33:41
      So we've typically seen any problems already, either in the development clusters or even at the summit where we're probably a little bit more rapid to update Then then, even on the stable channels in the cloud.

      [K-T Lim] 14:33:55
      So we're been really prepared for any of these things that happen

      [Lindsey Gray] 14:33:58
      okay. So the there's nothing really nasty about the cadence of updates or upgrades on the cloud side of things.

      [Lindsey Gray] 14:34:05
      And you have. You feel like you have control over the situation by and large, alright, cool.

      [Lindsey Gray] 14:34:10
      Thank you.

      [K-T Lim] 14:34:10
      Adequately. Yes, you do have to have people that are dedicated to to so keeping this up to date and porting them How You I know that There are some people in science who like the idea of well we're going to install our service on the machine and then sort of wrap the

      [K-T Lim] 14:34:26
      whole thing in amber and just kind of leave it there and have it run.

      [K-T Lim] 14:34:30
      And it the model. It cannot be that way. You have to deal with Os upgrade updates and and service updates, and be on top of

      [Lindsey Gray] 14:34:33
      Right.

      [Lindsey Gray] 14:34:40
      Okay, Cool: Thank you.

      [Enrico Fermi Institute] 14:34:44
      Okay, Tony.

      [Tony Wong] 14:34:46
      yeah, Oh, hi! So I I I got a basically A, you know, a couple of questions in rolled up into one with respect to storage, because you kept mentioning, you know, concerns about to the cost of egress and storage So you know, when I looked at Google and and Amazon, I noticed that they

      [Tony Wong] 14:35:05
      have many, many different levels of storage reliability, you know responsiveness, you know, backups, you know, 24 by 7 availability, etc., and so forth.

      [Tony Wong] 14:35:16
      Data, 0 do a study to determine which point in terms of the levels of of storage services that it needs where Cloud would be more, you know, will look more favorable, cost-wise compared to on-premise And you know and then if you if you look at the level

      [Enrico Fermi Institute] 14:35:16
      Okay.

      [Enrico Fermi Institute] 14:35:22
      Yeah.

      [Tony Wong] 14:35:40
      of service, then the other opposite side of the question is also: Did we all do a study to optimize costs with regards to social media most of the storage is going to be on prem but some of the storage is going to be on the cloud.

      [Tony Wong] 14:35:56
      So at what point is there? Is there a tipping point where a big pace to have a cloud services?

      [Tony Wong] 14:36:03
      You know storage, and how much does it Really, you know, Is it 10% of your data?

      [Tony Wong] 14:36:08
      20% of your data. At what point does it really pay to go on the cloud

      [K-T Lim] 14:36:14
      Okay, So yeah. A couple of things. First of all, just something that that our communications people make me say we.

      [K-T Lim] 14:36:23
      We try not to use abbreviations for the name of the observatory.

      [K-T Lim] 14:36:26
      We prefer to just have a call. The Vera Rubin Observatory rather than vro Second of all, Yeah, we did extensive modeling of how frequently our data is going to be accessed, because the there are obviously a different levels.

      [K-T Lim] 14:36:49
      Of oh, of access and and different prices for those are all ranging all the way from sort of standard object store in which there's actually even above that, there's there's positive file systems which tend to be quite expensive object store, which is frequently accessed, where you don't get charged

      [K-T Lim] 14:37:10
      so much per operation, Are you gonna charge? A tiny bit per operation, but not providing.

      [K-T Lim] 14:37:16
      and then, and and then there's a large rental cost for bytes per month, all the way down to the archival storage.

      [K-T Lim] 14:37:25
      cold storage, where you get charged a lot for retrieval, but considerably less for the actual storage per month, and so we looked at for each of our data, sets how many accesses would, we be expecting to have and as a result.

      [K-T Lim] 14:37:44
      What category of stores could we use for them? Unfortunately, a lot of our data sets.

      [K-T Lim] 14:37:51
      we don't really know. We know that what kinds of kinds of access patterns we have for our own processing, for the data releases or for yeah, they're prom processing or even somewhere for our device, But the science users might do anything And they can look at any data.

      [K-T Lim] 14:38:11
      anytime, and we it's hard to say which pieces of data will be accessed more than others.

      [K-T Lim] 14:38:18
      So a lot of that ends up kind of being attributed to the the most expensive object store.

      [K-T Lim] 14:38:27
      As a result, and again, as I mentioned some of the the total cost of ownership concerns or total cost of ownership, that the cloud vendors are charging may not be fully charged to the projects when it's on premises and as a result our

      [K-T Lim] 14:38:45
      on-premise costs considerably lower for those kinds of things we did, I mean so we we have those numbers in terms of comparison of on-prem, and in the cloud they will differ, depending on your institution.

      [K-T Lim] 14:38:59
      And the cost of hardware. I think we've been getting good deals from the vendors when they can actually deliver.

      [K-T Lim] 14:39:09
      so the the we we have, I mean. We know at what dollar per month cost.

      [K-T Lim] 14:39:19
      It would make sense to switch and we're not anywhere in the ballpark right now.

      [K-T Lim] 14:39:24
      but, as I said, for the archives, storage, when you compare it with tapes and tape robots and tape drives, and things like that, it may become that may have crossed over in terms of being cheaper in the long run to stored in the cloud than it

      [K-T Lim] 14:39:46
      is to store it on print, but some of that depends on what you assume in terms of how often you're going to retrieve it.

      [K-T Lim] 14:39:52
      Originally we were going to write all the raw data to tape immediately, and then actually reread it every year to do the reprocessing which would both guarantee that it was actually readable as well as as make the the costs of of retrieval

      [K-T Lim] 14:40:14
      make, Mcdonald's of storage of that raw data lower.

      [K-T Lim] 14:40:17
      But it turns out that the raw data now is, as time is going on, and it's not too bad to actually store it spinning.

      [K-T Lim] 14:40:26
      There are other reasons to store it spinning, and so we will not be doing that.

      [K-T Lim] 14:40:30
      So the the actual number of retrieval from tape is hoped to be near 0,

      [K-T Lim] 14:40:41
      Does that answer Enough of your question? I'm sorry

      [Tony Wong] 14:40:42
      Yes, it does. I think it is very informative. Thank you.

      [Enrico Fermi Institute] 14:40:47
      Right, Dirk. You want to jump in

      [Dirk Hufnagel] 14:40:50
      yeah, I had to relate related questions, and they're both on cloud cost What I was curious about is how you do budget for like cloud what you spend on cloud throughout the year.

      [Dirk Hufnagel] 14:41:06
      If you, if you tell yourself like for this year, we want to spend.

      [Dirk Hufnagel] 14:41:09
      We have this budget for cloud or if you're a bit more flexible where you allocate funds for on-premise or cloud throughout the year, and then, related, independently, of how you set that target we actually control your cost especially in light of still keeping the ability to support these elastic

      [Dirk Hufnagel] 14:41:26
      use cases, because I mean at some point, if you account, if you available money goes to 0, you can't really be fully elastic anymore.

      [Dirk Hufnagel] 14:41:32
      So

      [K-T Lim] 14:41:33
      Yes, yeah, the way that our budgeting works. We have separated out the cloud and the on-prem costs.

      [K-T Lim] 14:41:44
      so the they are not one. I mean. They're they are stemming from an original budget.

      [K-T Lim] 14:41:51
      But that budget has been divided up relatively early, and we've actually one of the ways of getting a discount from the Cloud provider.

      [K-T Lim] 14:42:03
      In this case, Google Cloud Platform was to provide a commitment that we would spend a certain amount.

      [K-T Lim] 14:42:08
      So we've already kind of pre budgeted.

      [K-T Lim] 14:42:12
      That amounts for a number of years, in order to to be able to to get substantial discounts, We don't expect them to be problem.

      [K-T Lim] 14:42:24
      One of the nice things is that it is just fungible dollar amount, and we can spend that on any services.

      [K-T Lim] 14:42:32
      so if we decide we we don't want what we originally wanted.

      [K-T Lim] 14:42:36
      We want to change it to something else. That's no problem.

      [K-T Lim] 14:42:40
      it's all the same dollars the in terms of the elasticity and budgeting.

      [K-T Lim] 14:42:48
      It is true that we do have to put in to place quotas and throttles, so that our users can't just all chew up the entire budget in the first, week and so they're they're too, need to be controls like that that are imposed in

      [K-T Lim] 14:43:05
      the services.

      [Enrico Fermi Institute] 14:43:10
      Okay.

      [K-T Lim] 14:43:11
      But that would apply on Prem as well. It's also not infinitely adjustable

      [Dirk Hufnagel] 14:43:13
      but on premise at least, it's limited.

      [Dirk Hufnagel] 14:43:17
      You can't run more than what the deployed hardware allows you. Do.

      [Dirk Hufnagel] 14:43:22
      You have enough data to to do to say how well your controls work

      [K-T Lim] 14:43:29
      so far again, because most of the access has been by friendly users and staff.

      [K-T Lim] 14:43:35
      we have not had any problems. Our Our budget is is large enough to cover all the uses that people have been have been doing of it.

      [Dirk Hufnagel] 14:43:38
      Okay.

      [K-T Lim] 14:43:46
      but the we are tracking on a weekly basis.

      [K-T Lim] 14:43:51
      How much we have been spending, and if anything looks like it is getting out of sync, then we can investigate and try to try to control that in terms of actually implementing the the throttles in the services the user facing services.

      [K-T Lim] 14:44:10
      we only have a couple of them in place right now and again.

      [K-T Lim] 14:44:15
      They have not been triggered, so it's hard to say

      [Enrico Fermi Institute] 14:44:18
      Yeah.

      [Dirk Hufnagel] 14:44:18
      Okay, thanks.

      [Enrico Fermi Institute] 14:44:22
      Okay, Done

      [Douglas Benjamin] 14:44:24
      Yeah, thanks for the nice talk and the the details. Alright on slide 10.

      [Douglas Benjamin] 14:44:31
      You sort of list the reasons why you know the use of the Cloud services, and on also the advantage, the advantages, it said security.

      [Douglas Benjamin] 14:44:41
      But that really meant authorization, right in the sense of you have 2 different user communities to get access to on.

      [K-T Lim] 14:44:41
      Yeah.

      [Douglas Benjamin] 14:44:52
      Do we site Resources requires a different level of authorization even within the like host lab itself, and necessarily what is currently available will fit it.

      [Douglas Benjamin] 14:45:06
      Identity and the moves with federated identity across the Us.

      [Douglas Benjamin] 14:45:13
      Dewey. Complex. Change that calculation at all for you

      [K-T Lim] 14:45:19
      So there are. There are 2 things there

      [Douglas Benjamin] 14:45:23
      Because I can't imagine the cloud is more secure.

      [Douglas Benjamin] 14:45:24
      Given all the acts of what have happened with cloud vendors and the services on them

      [Enrico Fermi Institute] 14:45:29
      Okay.

      [Douglas Benjamin] 14:45:32
      Then you imply with your statement

      [K-T Lim] 14:45:35
      yeah, yes, to no, I mean so off, first of all.

      [K-T Lim] 14:45:41
      Okay, let me say that at least the the cloud vendors are, I mean.

      [K-T Lim] 14:45:56
      Well, I think they have the ability to be highly secure, whether they're delivered.

      [K-T Lim] 14:46:02
      commercial level services are highly secure or not, or another question, but I would, I mean, even though our lab has a great cyber security team, I would still say that the cloud vendors are devoting more resources to, security, than we are.

      [K-T Lim] 14:46:18
      and so the it would seem that it would be at least comparable.

      [K-T Lim] 14:46:23
      that's sort of on a overall holistic level.

      [K-T Lim] 14:46:26
      in terms of the authorizations. Yes, when I mean one of the things that we're that we're doing is that the the Ruben data data sets are open to all astronomers within the us and Chile as well as selected names Astronomers from international

      [K-T Lim] 14:46:52
      partner, institutions. And so, even if we had an authentication of federation for the Us.

      [K-T Lim] 14:47:04
      Which we are point we are using in common and planning to use equivalents in Chile and in Europe.

      [K-T Lim] 14:47:12
      the the authorization part of it is still sort of unique to the project, And it also may include international partners who have some difficulties with the Us.

      [K-T Lim] 14:47:32
      Government or with us, labs

      [Douglas Benjamin] 14:47:33
      Yeah. I Msn: from the the cut the country. She's non grata

      [K-T Lim] 14:47:36
      Yeah, and so those I mean, sometimes not not even countries per se.

      [K-T Lim] 14:47:45
      But citizens of those countries who are working at other institutions, so that that can be better.

      [Douglas Benjamin] 14:47:49
      Right.

      [K-T Lim] 14:47:54
      And then the final part. The final thing is that in that security bullet Well, I guess I can pull it up again.

      [K-T Lim] 14:48:00
      But

      [K-T Lim] 14:48:04
      Where to go.

      [K-T Lim] 14:48:11
      Here we go. So I mentioned the limited interfaces, so I think that even if we get, and we kind of expect to have our user facing services packed eventually, just because a lot of it is even even if it's built on vendor infrastructure it's still our code and

      [K-T Lim] 14:48:33
      we could still have security problems. So even if that's hacked the interfaces with the on-prem facilities are limited in terms of what they can do, what they can extract, and the what they're extracting is public data or essentially public data.

      [K-T Lim] 14:48:51
      Anyway, it's screwed data rights holders. And so the the exposure to the lab is considerably smaller than if people are actually logging into machines at the lab and have access to the networks in a more general way etc. does that make sense

      [Douglas Benjamin] 14:49:07
      Hey! How'll make sense? I I realize that one of the difference between Reuben and the root, the science community and Ruben is a little bit different than me mit experiments, because it's one overall arching organization more people have certain accounts than don't on the Lhc

      [Douglas Benjamin] 14:49:29
      experiments. For example, the last quest, quick question I had on this.

      [K-T Lim] 14:49:31
      Yeah.

      [Douglas Benjamin] 14:49:35
      Was you also implied that storage costs are one of those things that you're worried about eating budget.

      [Douglas Benjamin] 14:49:40
      So that's why you limited You essentially. Have a lot of the story.

      [Douglas Benjamin] 14:49:47
      Most of this

      [Douglas Benjamin] 14:49:51
      The size of the storage, and the cloud seem to be limited relative to what you have on Prem.

      [K-T Lim] 14:49:58
      Right.

      [Douglas Benjamin] 14:50:00
      And is that really? For did I get the sense that it is to sort of cost containment for storage

      [Enrico Fermi Institute] 14:50:01
      Okay.

      [K-T Lim] 14:50:07
      Yes, I mean, basically we. We have hundreds to even thousands of petabytes of data that we were expecting to store by the end of the survey, and that cost of storing all of that for months and months and months in the cloud would be excessive compared to our

      [K-T Lim] 14:50:30
      budget. So we are storing on the order of, you know, single digit percents of that in the cloud with the rest of it on prem, in order to fit within the available budget

      [Douglas Benjamin] 14:50:50
      Thank you very much

      [Fernando Harald Barreiro Megino] 14:50:56
      Yeah. Yeah. Simple question, Where did you choose? Google over other cloud?

      [Fernando Harald Barreiro Megino] 14:51:04
      Then there was just so the best deal, or anything else

      [K-T Lim] 14:51:10
      0 are a number of reasons, I forget whether we actually have.

      [K-T Lim] 14:51:15
      I don't think we have a a public document that actually states this.

      [K-T Lim] 14:51:23
      yeah. But I can see, I think, that there were a number of factors Pricing is only one of them, and as I mentioned the the ability to work with Google, and there are flexibility and ability to to work well, with our engineering teams, was

      [Enrico Fermi Institute] 14:51:31
      Hmm.

      [K-T Lim] 14:51:58
      actually a fairly major factor as well. The quality of their services.

      [K-T Lim] 14:52:05
      in some cases, I mean a lot, a lot of the things that we use.

      [K-T Lim] 14:52:08
      We try to be vendor, agnostic, and so the interfaces are all the same.

      [K-T Lim] 14:52:15
      But the performance underlying those interfaces can change from from vendor to vendor. And I guess it's something that I've actually kind of complained to Google about sometimes that that they often seem to offer a much better product.

      [K-T Lim] 14:52:31
      Than we need, but at a somewhat more expensive price. And so sometimes, though they're taking advantage of those, those improved performance formus capabilities can actually be useful.

      [Enrico Fermi Institute] 14:52:38
      Thank you.

      [K-T Lim] 14:52:44
      so for things like, yeah, I mean, just as an example of the the object store retrieval, even for the the sort of the coldest archive level of storage the latencies to actually retrieve the data can be almost the same as for normal object store on Google cloud whereas

      [K-T Lim] 14:53:03
      Amazon's glacier. There are much greater latencies, most of which we wouldn't mind, but it can be nice to have that capability to just grab something if you you need it every once in a while.

      [K-T Lim] 14:53:15
      So they're they're very a variety of reasons like that.

      [K-T Lim] 14:53:20
      I don't think I can go and say everything that in terms of the vendor comparisons at this time, just because I don't think it's all public

      [Enrico Fermi Institute] 14:53:29
      So. So it was very Ruben using the subscription model with Google

      [K-T Lim] 14:53:37
      subscription model in terms of what, where we have.

      [K-T Lim] 14:53:42
      a sort of it's not a prepaid kind of plan, but it?

      [K-T Lim] 14:53:49
      Is, there is a commitment to spend a certain amount over a certain amount of time, and and we get a number of discounts as a result of that

      [Enrico Fermi Institute] 14:53:59
      So so have you done that. Has that negotiation happened like one?

      [Enrico Fermi Institute] 14:54:05
      Or as it is it then, you know, have you gone back, and you know you committed to doing something for some amount of time, and then you went back and negotiated a new deal of you have you done anything like that, and if you have has it been have there been significant like deltas in the

      [Enrico Fermi Institute] 14:54:21
      pricing, or anything like that

      [K-T Lim] 14:54:23
      Right. So we have done that once for the interim data facility, which was for a period of 3 years, and that time period is kind of coming up soon.

      [K-T Lim] 14:54:35
      we are in the middle of working on purchasing a of doing a similar kind of purchase plan for the beginning of operations of the survey, which will, start in 24, and that negotiation I mean, it's it's actually a question there's there's some question

      [K-T Lim] 14:54:58
      about whether we can do it as a source agreement that's being thought about as well as what the pricing will be.

      [K-T Lim] 14:55:07
      We are in our discussion. So far, we expect that the person will be, if anything lower, then we're we're getting it before for a number of things.

      [K-T Lim] 14:55:17
      we are cognizant and worried about potential for vendor lock in.

      [K-T Lim] 14:55:23
      We've had cases with unnamed databases where licenses have radically escalated in price, and with other systems, where similarly, there can be unexpected costing, increases.

      [K-T Lim] 14:55:38
      we don't necessarily see that happening here for a number of reasons.

      [K-T Lim] 14:55:41
      Again, We're trying to use commodity interfaces and services that we could get from another vendor if necessary, or even deploy ourselves on premise We absolutely had to and and they're also ones that are very commonly used commercially, they're not sort of unique

      [K-T Lim] 14:56:02
      to science in any way, and so there are additional pressures to keep the cost down that way.

      [Enrico Fermi Institute] 14:56:07
      Okay, Thank you.

      [Enrico Fermi Institute] 14:56:08
      Okay, thank you. Doug. Did you have another comment?

      [Douglas Benjamin] 14:56:10
      yeah, it's 2 questions. They're slightly different. One is sort of high level on the science platform in the cloud.

      [Douglas Benjamin] 14:56:19
      How long do you envision, negotiating your contracts with the vendor?

      [Douglas Benjamin] 14:56:25
      For is it like a 5 year period, and then you renegotiate after 5?

      [Douglas Benjamin] 14:56:29
      Or is it for the sort of lifetime of the data taking?

      [Douglas Benjamin] 14:56:34
      Because I know the data has to be around for much longer than the actual bucks bear.

      [Enrico Fermi Institute] 14:56:36
      Hello!

      [Douglas Benjamin] 14:56:40
      The The telescope, runtime

      [K-T Lim] 14:56:41
      Right? So there are 2 things there. One is that the survey is scheduled to run for 10 years We'll be taking data for 10 years, and we are on the project budget.

      [K-T Lim] 14:56:54
      We are committed to providing data products and services to the science users.

      [K-T Lim] 14:57:00
      For 11 years after the shortest survey operations.

      [K-T Lim] 14:57:04
      so 10, plus actually sorry. No, I see correct that.

      [K-T Lim] 14:57:10
      I think it's been updated to 12, years. So it's all it's 10 years of data taking one year of processing the 10 years.

      [K-T Lim] 14:57:16
      And then one year delivery of that data release. So that's 12 years.

      [K-T Lim] 14:57:21
      We will. There There is a a plan for for archiving this data.

      [K-T Lim] 14:57:28
      And and and preserving it for indefinite periods.

      [K-T Lim] 14:57:35
      After the end of the project. But that plan is not funded by the project.

      [K-T Lim] 14:57:39
      It has to be funded by the Nsf. Separately, and so we don't know exactly how that's gonna happen. Of course it is.

      [K-T Lim] 14:57:44
      you know 14 years in the future. Now the negotiation for the purchase of the cloud, so services will be for a particular term, and it's likely to be something on the order of 3 to 5 years, and then adjusted thereafter some of that is on the vendor

      [K-T Lim] 14:58:04
      side, actually that they are not sure how much they want to commit to.

      [K-T Lim] 14:58:09
      For for a long period of time. But okay, I think it also benefits us in that we have the opportunity to change and necessary

      [Douglas Benjamin] 14:58:19
      Okay? Next. And then my orthogonal question, was you mentioned?

      [Douglas Benjamin] 14:58:24
      Bring your own for the the science platform. Does that mean? Bring your own?

      [K-T Lim] 14:58:28
      Yeah.

      [Enrico Fermi Institute] 14:58:28
      Yeah.

      [Douglas Benjamin] 14:58:32
      Since you said, Google and your good, the science platform will be in Google.

      [Douglas Benjamin] 14:58:35
      Does that mean? Bring your own to Google? Or does that mean I'm a researcher at a university, and my university is made available.

      [Douglas Benjamin] 14:58:45
      Some compute resources for me to use, and I want to stitch my university stuff

      [K-T Lim] 14:58:52
      Yeah, So stitching the university stuff with the cloud platform will be a little bit more difficult.

      [K-T Lim] 14:59:02
      We will have interfaces to be able to do sort of bulk.

      [K-T Lim] 14:59:05
      Downloads of chunks of data that you might want to process locally.

      [K-T Lim] 14:59:10
      but that's not It's expected to be used mostly by certain collaborations that will be working to get and using large scale facilities like nurse.

      [K-T Lim] 14:59:25
      the what I was referring to there specifically was for a smaller scale, You know, collaborations or individual investigators who might have, you know.

      [K-T Lim] 14:59:40
      Do you do what each you know? I don't wanna spend $10,000 on a Google account and purchase compute, using that, and then federating that with the science platform and being able to expand their capabilities using all the same, tools that they're already using but just

      [K-T Lim] 14:59:59
      now having increased resources to be able to work with it just by having a separate account

      [Douglas Benjamin] 15:00:05
      okay, yeah, And so you for the the out of the scale, the out out of perfect Google premises stuff you're really assuming a certain size before it becomes tenable and feasible.

      [Douglas Benjamin] 15:00:20
      What I mean is there's a there's a certain number of scientists buying bound to work to get collaborate together to Do a you know, provide a certain amount of resources?

      [Douglas Benjamin] 15:00:30
      To do some work? Then it's worth your time to bring stuff in.

      [Douglas Benjamin] 15:00:34
      If someone shows up with, you know, 5 computers in their data center, you still gonna do that.

      [Douglas Benjamin] 15:00:42
      You know, there must be be a threshold

      [K-T Lim] 15:00:42
      Yeah, so yeah, if if they again, so, if they have 5, computers, and they wanted to download data that fits on those 5 computers.

      [K-T Lim] 15:00:52
      And then do whatever they want with it. That's fine.

      [K-T Lim] 15:00:55
      and there's no problem with that. If there. If they have, you know 10,000 cores, and you know they want to download a 100 petabytes worth of data. Then we need to chat with them about how we do that.

      [Douglas Benjamin] 15:01:08
      Okay, thanks.

      [K-T Lim] 15:01:13
      But I mean, again, the the the advantage here is is really about the flexibility of using any resources that are available in the cloud to work on the same data, because all the data is already there

      [Douglas Benjamin] 15:01:27
      but it's in the cl in the same cloud. Provider: Okay, thanks.

      [K-T Lim] 15:01:30
      In in the same cloud. Yes.

      [Enrico Fermi Institute] 15:01:33
      Okay.

      [K-T Lim] 15:01:34
      Yeah, I was looking at

      [K-T Lim] 15:01:40
      it's very interesting, but I think zoom with Bill is only usable to bring data into aws as far as I could tell, exporting.

      [K-T Lim] 15:01:54
      It is not so easy

      [Lindsey Gray] 15:01:57
      that Dna of a day. It's a it's a truck You should be able to go.

      [Lindsey Gray] 15:02:01
      Both directions

      [K-T Lim] 15:02:02
      Yeah, they they have an E ink label on it, though they're automatically changes to point to the local Amazon data Center

      [Enrico Fermi Institute] 15:02:13
      Okay, other questions or comments for Kt

      [Enrico Fermi Institute] 15:02:22
      Okay, if not, Katie. Thank you very much. This was really informative, and I think everybody learned a lot from this

      [K-T Lim] 15:02:27
      Thank you. I think there will also be talks, maybe on very similar.

      [K-T Lim] 15:02:32
      So topics at software and computing atlas workshop, I think there's in October and they're even talking about.

      [K-T Lim] 15:02:44
      Maybe one of us going to Cern in the spring for similar kinds of conversations.

      [K-T Lim] 15:02:49
      So perhaps we can also talk then, anyway. Thank you very much for inviting me

      [Enrico Fermi Institute] 15:02:52
      Great. Yeah, Thank you.

      [Enrico Fermi Institute] 15:02:58
      Yeah, So, have a few more things left in the slide that in the agenda let me share the screen again.

      [Enrico Fermi Institute] 15:03:13
      This was the the charge question that was posted before we had Katie's presentation.

      [Enrico Fermi Institute] 15:03:20
      The next thing was, you know, just kind of as we As we wrap up the the workshop, you know. Were there?

      [Enrico Fermi Institute] 15:03:28
      Were there, any, You know Other other topics, any other observations that folks wanted to bring up here?

      [Enrico Fermi Institute] 15:03:35
      I think it would be a great time to, you know, if there's anything that we haven't covered, that we should have covered. You know folks want to bring that up now. It's a great time

      [Paolo Calafiura (he)] 15:03:47
      maybe I missed it, and we discussed the about the next steps

      [Enrico Fermi Institute] 15:03:51
      Yeah, I have a slide for that. Actually.

      [Paolo Calafiura (he)] 15:03:54
      Good.

      [Enrico Fermi Institute] 15:03:56
      once you had something

      [Lindsey Gray] 15:03:57
      yeah, since Dale's still here, and I forget to ask it this morning.

      [Lindsey Gray] 15:04:03
      I wanted to understand. Going back to what you were talking about yesterday.

      [Lindsey Gray] 15:04:08
      I wanted to understand the process for setting up a level 2 or level 3 circuit, and if that just requires having you know, an appropriate certificate from an authority, and it can be, it can be largely automated or does someone actually have to physically appropriate things

      [Lindsey Gray] 15:04:27
      at some point in in whenever you're making every single circuit.

      [Lindsey Gray] 15:04:32
      How automated can that whole process be

      [Dale Carder] 15:04:35
      yeah, on the yes, that side for for layer 2. It can be fairly automated.

      [Enrico Fermi Institute] 15:04:40
      Okay.

      [Dale Carder] 15:04:41
      I forget off the end how it works, but that that's sort of tried true for layer through.

      [Dale Carder] 15:04:46
      That's to be defined. Typically, there's still other constraints.

      [Dale Carder] 15:04:53
      People might wanna consider such as is there adequate capacity on the physical infrastructure in which these services are virtualized.

      [Dale Carder] 15:05:00
      So there there still ends up being in like, you know, a a period of consultation before it gets the point of making dynamic.

      [Dale Carder] 15:05:08
      Api calls

      [Lindsey Gray] 15:05:10
      okay, yeah, as this will be a bit of a longer thought process.

      [Lindsey Gray] 15:05:15
      But I'm just wondering yeah, exactly how much of that can Billy really be automated?

      [Lindsey Gray] 15:05:20
      Because even in terms of like trying to understand what capacity you have at this, at the site or between 2 sites, that that's something that you could also figure out how to negotiate mostly automatically if you had some orchestration service dealing or dealing with both sites

      [Lindsey Gray] 15:05:37
      So huh? Alright, I'm gonna I'll chew on that some more. In probably send you an email.

      [Dale Carder] 15:05:43
      Yeah, also, and I think it's in the snow mass report on in the network section.

      [Dale Carder] 15:05:47
      I forget that was like Area 4. There's some some more stuff around that

      [Lindsey Gray] 15:05:48
      Okay.

      [Lindsey Gray] 15:05:54
      Alright great thanks.

      [Enrico Fermi Institute] 15:05:57
      Andy, you're in a comment

      [abh] 15:06:00
      yeah, there was a slide way back, I mean way back for today about what other services are needed to make cloud access.

      [abh] 15:06:09
      more efficient, better whatever, and I wasn't able to attend that particular section because I had a conflicting meeting.

      [abh] 15:06:18
      So I didn't want to mention a project of we're working on that that is called S.

      [abh] 15:06:23
      3 gateway. It was Big Thing is to be able to transition to the cloud easily by essentially allowing the current Hep security infrastructure, whether it be or tokens should be used with any cloud provider because with the S 3 gateway basically translates, all of

      [abh] 15:06:46
      those credentials applies the security that's necessary, and then does whatever you need to do in terms of uploading download to the cloud storage.

      [abh] 15:06:54
      Yeah. So that is in my mind, a a great way to branch the gap before we actually have to deal with the actual cloud provider security.

      [abh] 15:07:07
      so that people can start using this stuff relatively easily with their existing credentials.

      [Enrico Fermi Institute] 15:07:16
      Interesting. So I had a quick question about that. Andy does this only work in front of, you know a cloud provider?

      [Enrico Fermi Institute] 15:07:24
      Or would that be something that would work in front of Minio, for instance, that just provide Okay, awesome

      [abh] 15:07:28
      It definitely works in front of Menio

      [Enrico Fermi Institute] 15:07:34
      So does this help at all with the cloud certificate, authority type of problems?

      [Enrico Fermi Institute] 15:07:38
      Or is this

      [abh] 15:07:40
      not exactly. I mean, the problem is, it solves the problem perfectly for storage.

      [abh] 15:07:46
      Right doesn't quite solve the problem for compute access that you need something else for

      [abh] 15:07:54
      And I would assume Panda could handle that. But then, the panda theme would have a team would have to say something on that

      [Enrico Fermi Institute] 15:07:54
      Yeah.

      [Enrico Fermi Institute] 15:08:03
      But the compute is a bit easier, though right, because you know presumably the the experience, control, the environment that they run in so they can add whatever the ca authorities they want.

      [Enrico Fermi Institute] 15:08:14
      You know For instance, the if you, if you run the the Osg, you know regular grid.

      [Enrico Fermi Institute] 15:08:17
      C a cert bundle, or whatever that has. For instance, the in common.

      [Enrico Fermi Institute] 15:08:23
      I'm sorry, not in common, but the let's encrypt.

      [Enrico Fermi Institute] 15:08:26
      Ca: trusted right? So that that's a little bit easier.

      [abh] 15:08:26
      right, yeah, but I mean again it it It really depends on how that's integrated.

      [Enrico Fermi Institute] 15:08:28
      There.

      [abh] 15:08:34
      So I I don't know the specifics of on the compute side, and how that's done.

      [abh] 15:08:38
      So I can't really address that. But presumably, you know, like Panda, and equivalent systems, can provide the same kind of access

      [Fernando Harald Barreiro Megino] 15:08:49
      okay, for the integration with the cloud. What we do as we talk directly with the Google need to say Api: And But we have

      [Fernando Harald Barreiro Megino] 15:09:02
      Now. So this account Jason, into the harvesting machine, and and that's it.

      [Fernando Harald Barreiro Megino] 15:09:10
      Anything else.

      [Enrico Fermi Institute] 15:09:16
      Okay.

      [Enrico Fermi Institute] 15:09:20
      See, some hinders, Philip, did you guys have other comments or that's just cool.

      [Fernando Harald Barreiro Megino] 15:09:25
      no, I for

      [Enrico Fermi Institute] 15:09:28
      Okay? Yeah, done Okay?

      [Douglas Benjamin] 15:09:31
      Yeah, one thing I was wondering about is if

      [Douglas Benjamin] 15:09:37
      Right now at least for Atlas, the Hpcs and the Cloud Instances are what I'll call standalone in the sense of we are not flowing work.

      [Douglas Benjamin] 15:09:54
      We're not using either of which as an extension of our existing resources sort of a blending of on prem and off per in the same, let's say batch system for example, is that something we actually should be investigating for the future.

      [Enrico Fermi Institute] 15:10:17
      Are are you? Are you suggesting, like taking a tier?

      [Enrico Fermi Institute] 15:10:19
      2 and and adding cloud resources to it. And okay.

      [Douglas Benjamin] 15:10:23
      Something like that. I mean I know Dirk mentioned that they were sort of looking at that, and maybe it's not a tier.

      [Douglas Benjamin] 15:10:29
      2. Maybe it's it's in the analysis space.

      [Douglas Benjamin] 15:10:33
      Hello! What was shown with Reuben, though they're not doing on-prem off crime.

      [Enrico Fermi Institute] 15:10:37
      See.

      [Douglas Benjamin] 15:10:42
      In that they separated the science part to the cloud. Okay.

      [Enrico Fermi Institute] 15:10:45
      I mean. Personally, I think that in the analysis space cloud makes a lot of sense right?

      [Enrico Fermi Institute] 15:10:50
      Because you can get a lot of yeah, resources, any kind of resource you want.

      [Enrico Fermi Institute] 15:10:54
      You can get it very quickly if pay for it, obviously, but

      [Enrico Fermi Institute] 15:11:05
      I don't know if anybody else had other comments on that

      [Enrico Fermi Institute] 15:11:13
      Okay, other comments in general questions, other things, we miss

      [Enrico Fermi Institute] 15:11:23
      1 one. Meta question is, how well, did this format work for people

      [Douglas Benjamin] 15:11:28
      well, I have a question. There was a policy is that did that get Will that be talked about?

      [Enrico Fermi Institute] 15:11:30
      Yes.

      [Douglas Benjamin] 15:11:36
      Or is that you have a one line in the pilot on the old?

      [Douglas Benjamin] 15:11:40
      In the indigo that says policy, features.

      [Enrico Fermi Institute] 15:11:43
      yeah, I mean, I think in the, in the, in the code that was actually in the in the morning, right?

      [Enrico Fermi Institute] 15:11:49
      and that was just what what can we

      [Enrico Fermi Institute] 15:11:53
      You know one of those? What are the policies that we could ask for would make our lives easier

      [Douglas Benjamin] 15:12:01
      No cause. Some of the policies for the Hpcs can be rap can be very disruptive, and I know we didn't talk about it the Ls. The the incentives for the Lcs.

      [Douglas Benjamin] 15:12:11
      Right if we can work with doe and ultimately and that's for their biggest machines to really think.

      [Douglas Benjamin] 15:12:19
      Rethink the the policies that they, their incentives for what they, what there minders.

      [Douglas Benjamin] 15:12:30
      Hello! Them that might make might make these resources better for us

      [Enrico Fermi Institute] 15:12:36
      Yeah, yes, right?

      [Douglas Benjamin] 15:12:37
      Assuming we can run on them. Blah blah blah, So by that I mean as a very specific example, Argon Alcf schedules, the biggest jobs fastest from the smaller jobs, And yet, you know, we were able to atlas to you to run across the whole machine but we had to

      [Enrico Fermi Institute] 15:12:59
      Okay.

      [Douglas Benjamin] 15:13:01
      chop things up a little bit, which is a little different than you know the to different incentives.

      [Douglas Benjamin] 15:13:10
      And that's where we need help going up through Doe to talk 1 one office to the other.

      [Enrico Fermi Institute] 15:13:17
      Right.

      [Enrico Fermi Institute] 15:13:21
      Yeah, I mean, So how does that, you know? Get get translated into something that we We we put in the report right like we just say we you know we

      [Douglas Benjamin] 15:13:30
      So you do it as a finding, and then then maybe a recommendation.

      [Douglas Benjamin] 15:13:35
      If it's strong enough

      [Enrico Fermi Institute] 15:13:36
      Yeah.

      [Douglas Benjamin] 15:13:38
      That's how it gets translated into the report.

      [Dale Carder] 15:13:38
      ha! Having served on and say, ha! Having served on the review committees for, like some of the Lcs, it could be very interesting for you guys to read the annual reports.

      [Dale Carder] 15:13:49
      They write, for example, I'm sure Alcf has at least their 2021 report published, online and see the terminology they use, and how they describe.

      [Dale Carder] 15:14:00
      You know the essentially the the incentive structure they're given, and how they match jobs to the charge.

      [Dale Carder] 15:14:06
      They're given and phrasing things in those terms, I would certainly help

      [Enrico Fermi Institute] 15:14:17
      Yeah. This seems related to the comment that was made. I think, on Monday about.

      [Enrico Fermi Institute] 15:14:22
      You know we had to. We have to kind of sell it

      [Enrico Fermi Institute] 15:14:27
      In their terms

      [Dale Carder] 15:14:28
      yeah, it's so. And it's also there's you know, there's aside from just a difference between Htc.

      [Dale Carder] 15:14:34
      And Hpc. There's still sort of differences in language you know about how these workflows, you know, exist, and what resources they need, and why it should be a national imperative to make sure that.

      [Dale Carder] 15:14:49
      The resources are allocated for your Us. Case

      [Enrico Fermi Institute] 15:15:02
      other comments there

      [Enrico Fermi Institute] 15:15:08
      Other comments.

      [Douglas Benjamin] 15:15:08
      yeah, I think we phrase the language right and the reason, I say, that is because I had a conversation with the some activity I was doing Where That's where it came up.

      [Enrico Fermi Institute] 15:15:20
      Okay.

      [Douglas Benjamin] 15:15:20
      Good. If we can give the the Lc. Apps some coverage down, work with them to change the policy.

      [Douglas Benjamin] 15:15:29
      They really do want to have their machines be usable to a large cross-section of the science.

      [Douglas Benjamin] 15:15:35
      But they have as the way they're being judged against.

      [Douglas Benjamin] 15:15:43
      Until that changes they can't do things that make them much more usable for us.

      [Douglas Benjamin] 15:15:48
      So that's where we have to help them. Help us in.

      [Enrico Fermi Institute] 15:15:50
      Sure.

      [Douglas Benjamin] 15:15:51
      A report like this might be able to do that if it goes far, far enough up the food chain

      [Enrico Fermi Institute] 15:16:03
      Okay, Thanks. Time.

      [Enrico Fermi Institute] 15:16:07
      Al thoughts

      [Enrico Fermi Institute] 15:16:17
      Okay. Maybe I'll move on to the the last slide here.

      [Enrico Fermi Institute] 15:16:20
      So yeah, what happens next? I mean, first of all, you know, we we really appreciate everybody kind of slogging and out and doing this for for 3 days as a mix of mostly remote, and some some folks coming in person.

      [Enrico Fermi Institute] 15:16:32
      We really appreciate everybody's input, you know, I I think at this point, you know, our next steps would be to you know.

      [Enrico Fermi Institute] 15:16:39
      We'll we'll follow up with some folks after the workshop, I'm sure that will want to, pick your brains a bit.

      [Enrico Fermi Institute] 15:16:48
      More about specific areas of the report that we're writing.

      [Enrico Fermi Institute] 15:16:52
      yeah, we would really appreciate if you, you know, keep sending your feet back, keep sending us your thoughts on these topics, you know.

      [Enrico Fermi Institute] 15:17:01
      I I put some kind of tentative date here for a draft. You know.

      [Enrico Fermi Institute] 15:17:06
      We're going to plan to, you know, have a draft out by, you know, the first in November, and then our our report will be finished by December.

      [Enrico Fermi Institute] 15:17:15
      First. Yeah, Yeah, So thank you. You know any of the other folks had comments.

      [Enrico Fermi Institute] 15:17:27
      Stir, no no.

      [Dirk Hufnagel] 15:17:33
      No, I think I just wanna maybe add that, thanks to everyone that participated in the discussion, we are very appreciative of that was, we got some good discussion going now, we Just we have to go through the notes summarize get a report, maybe get more input If there's more input, follow up.

      [Enrico Fermi Institute] 15:17:50
      Yeah, yeah, the folks want to to help write pieces, you know, I think that's a that would be okay.

      [Dirk Hufnagel] 15:17:51
      With some people.

      [Dirk Hufnagel] 15:17:59
      There's still some hands up, so maybe there are some last minute comments

      [Enrico Fermi Institute] 15:18:00
      Yeah, indeed. Eric.

      [Eric Lancon] 15:18:03
      yes, I wanted to congratulate you, too, for organizing this workshop.

      [Eric Lancon] 15:18:09
      When I saw the Ad. I said I thought it would be mission impossible to, and bet you did it.

      [Eric Lancon] 15:18:18
      Oh, and it was very interesting, very fruitful discussions, Congratulations!

      [Eric Lancon] 15:18:24
      Thank you very much

      [Enrico Fermi Institute] 15:18:25
      Thank you. Tony.

      [Tony Wong] 15:18:28
      So you talk about keep sending us your feedback. Where should that feedback be?

      [Tony Wong] 15:18:33
      Just to send it to your own personal email is that how it works

      [Enrico Fermi Institute] 15:18:36
      Yeah, you know, I thought about that when I when I wrote that bullet point in the slide, I guess I should have put the put the email addresses on there, or put it's a mailing list or something Yeah, maybe Maybe we'll follow up.

      [Enrico Fermi Institute] 15:18:48
      With another mail, you know, with with how folks could could you keep contributing, and how we can keep the conversation going

      [Enrico Fermi Institute] 15:19:01
      Other comments.

      [Paolo Calafiura (he)] 15:19:05
      just want to echo what Eric said. Great job, guys

      [Enrico Fermi Institute] 15:19:10
      Hey? Thank you very much. Okay? Well, thanks. Everybody. Yeah, We'll follow up.

      [Enrico Fermi Institute] 15:19:16
      Appreciate your time.

      [Lindsey Gray] 15:19:17
      Thank you.

      [Dirk Hufnagel] 15:19:19
      hey? Thanks bye.

      [Enrico Fermi Institute] 15:19:22
      Hi! Everyone.

      • 13:00
        Invited contribution from Vera Rubin. Cloud experience and plans 20m
        Speaker: Kian-Tat Lim (SLAC National Accelerator Lab/Vera C. Rubin Observatory)

        [Eastern Time]

         

        let's let's see what we've been doing previously, and try to wait till maybe 5 after before we get started, and I think then we'll we'll jump right into rotation

        [Enrico Fermi Institute] 14:01:53
        we'll do it just a couple of minutes. Here.

        [K-T Lim] 14:02:08
        I only figured out how to log in discern

        [K-T Lim] 14:02:22
        So I can upload

        [Enrico Fermi Institute] 14:02:35
        Okay.

        [Enrico Fermi Institute] 14:02:38
        Yeah, Just let me know, Kt: when you're ready to to start sharing.

        [Enrico Fermi Institute] 14:02:43
        probably wait for people to into the room. I've hosted the sort of relevant charge question Here, What can us, Alice?

        [Enrico Fermi Institute] 14:02:53
        Us. The related international efforts, like

        [K-T Lim] 14:02:59
        Yup, so I'm ready to share any time, and I've uploaded a Pdf.

        [K-T Lim] 14:03:08
        Of my slides and presenter notes to the indicator page

        [Enrico Fermi Institute] 14:03:11
        Okay, awesome. Thank you. Yeah. In the morning session we had a few more people.

        [Enrico Fermi Institute] 14:03:16
        So let's see if we can We can get some of those folks back, maybe like 2 more minutes.

        [K-T Lim] 14:03:17
        Okay.

        [Enrico Fermi Institute] 14:03:23
        And then, yeah, we can serve

        [K-T Lim] 14:03:23
        Sure.

        [K-T Lim] 14:03:29
        It's a little early for lunch here, but

        [Enrico Fermi Institute] 14:03:52
        okay.

        [Enrico Fermi Institute] 14:04:34
        okay, Maybe let's go ahead and get started. I'm gonna stop sharing Kt: and then you can.

        [Enrico Fermi Institute] 14:04:39
        You can start sharing your slides

        [K-T Lim] 14:04:39
        okay, okay.

        [K-T Lim] 14:04:52
        Let me see

        [K-T Lim] 14:04:53
        Let me see! Oh, geez

        [Enrico Fermi Institute] 14:05:03
        Okay.

        [K-T Lim] 14:05:07
        Hold on need to grant access, because apparently I updated something

        [K-T Lim] 14:05:07
        hold on!

        [K-T Lim] 14:05:25
        okay, I will be back in a second

        [Enrico Fermi Institute] 14:05:27
        okay.

        [K-T Lim] 14:06:01
        hey? Hopefully, this works better. That looks a lot better.

        [K-T Lim] 14:06:08
        okay.

        [Enrico Fermi Institute] 14:06:08
        Okay, great.

        [K-T Lim] 14:06:16
        Oh, is that good

        [Enrico Fermi Institute] 14:06:17
        Yep.

        [K-T Lim] 14:06:21
        Okay? Well, that's good started then. Thank you very much for inviting me Happy to share a little bit about the ribbon observatories experienced with cloud computing.

        [K-T Lim] 14:06:37
        how we got to where we are, where we think we are, and where we're going.

        [K-T Lim] 14:06:44
        Let me just start by saying that usually clouds are very bad for a strong numbers.

        [K-T Lim] 14:06:49
        You can see some clouds on the horizon over Laserna, in the left, and then there are plenty of desk clouds in the Milky Way, and all of those block use of things that as tremors like to see but in this, case they're actually pretty good so we

        [K-T Lim] 14:07:02
        we like the way that the cloud is working out for us.

        [K-T Lim] 14:07:07
        What is the Ruben Observatory doing? The Ruin Observatory is being built on top of a mountain in Chile, in order to perform the legacy survey of space and Time the survey will scan the scary, taking 20 TB tonight of school 30 s images

        [K-T Lim] 14:07:23
        that'll cover the entire visible sky every few days.

        [K-T Lim] 14:07:25
        This is essentially a movie of the whole sky, or at least that we can of the sky that we can see in the we have several different data products that are produced on different cadences.

        [K-T Lim] 14:07:36
        So first of all, we have prompt data products that generate primarily alerts.

        [K-T Lim] 14:07:43
        these are indications that something has changed in the sky from what it used to be, And so we need to process the images from the telescope within 60 s to issue those alerts, So the other.

        [K-T Lim] 14:07:57
        Telescopes thing, then follow them up, and and and observe things that have changed our data, release Production executes once a year approximately, and it reprocesses all images that have been taken to date using a consistent set of algorithms and configurations and so

        [K-T Lim] 14:08:16
        that's obviously a data set that's growing each time, and the complexity of the analysis is likely to grow each time as well.

        [K-T Lim] 14:08:24
        so that that needs to to oh, go faster and faster as we're as we're progressed, saying, because we want to issue one day, release each year, and finally, definitely, not at least, we have the Ruben science platform which provides access to the data.

        [K-T Lim] 14:08:42
        Products and services for all science users, and project staff to do analysis and reprocessing of the data that has been taken, not shown on this slide, but also important is our internal stuff.

        [K-T Lim] 14:08:56
        Developers need to do both ad hoc and production style processing as well So that's another Sync: Okay, and storage.

        [K-T Lim] 14:09:05
        So a kind of architecture that we have to actually perform this as a data management system that looks like this.

        [K-T Lim] 14:09:15
        Here we have the telescope as kind of an input device off on the left hand side.

        [Enrico Fermi Institute] 14:09:18
        Okay.

        [K-T Lim] 14:09:19
        My colleagues who are working on actually building the thing that's pictured behind me would argue that they're doing a lot of the work.

        [K-T Lim] 14:09:27
        But we think that most of it is in the data management system over here.

        [K-T Lim] 14:09:30
        so we grab stuff at the summit on the left hand side of the Us. Data facility.

        [K-T Lim] 14:09:35
        We have the prom processing chain that's on running in near real time. And issuing alerts.

        [K-T Lim] 14:09:41
        After this hands community in the middle of the right hand side of this diagram we have offline processing that is, executing in sort of batch mode.

        [K-T Lim] 14:09:51
        It's high throughput computing, not high performance computing.

        [K-T Lim] 14:09:54
        and it's running across multiple sites. We have partners, and in France, at Cci and and in the Uk who will be executing large portions of the data release production.

        [K-T Lim] 14:10:06
        And and then finally at the bottom, and in the upper right.

        [K-T Lim] 14:10:10
        we have dedicated resources for this science user access and analysis on the Ruben Science platform.

        [K-T Lim] 14:10:18
        I'll talk about that more later

        [K-T Lim] 14:10:23
        We did a number of proof of concept engagements to try to determine how the cloud could work with us, and with this architecture.

        [K-T Lim] 14:10:32
        so we did 3 different engagements, with 2 separate cloud vendors, and they're documented in a bunch of data management, technical notes, which are all linked from this page, the first one in each series is the the goals of the engagement.

        [K-T Lim] 14:10:49
        What we set out to do, and then we have a report of what we actually manage to accomplish.

        [K-T Lim] 14:10:54
        So the first engagement we mostly leverage to get sort of cloud native experience, and how to deploy services.

        [K-T Lim] 14:11:05
        And systems in modern technologies to improve our deployment models, to get things containerized, etc.

        [K-T Lim] 14:11:14
        and not just have them running as shell scripts or things that an individual developer ran.

        [K-T Lim] 14:11:20
        we learned about potential bottlenecks, and how I been with delay product networks.

        [K-T Lim] 14:11:25
        Obviously, we're transmitting data from from Chile to the Us.

        [K-T Lim] 14:11:29
        that's over 200 ms, and it's a 100 gigabit network.

        [K-T Lim] 14:11:35
        So very high, bandwidth. So we need to get the data, And we need, we need to make that work efficiently.

        [K-T Lim] 14:11:41
        And so there were a number of bottlenecks.

        [K-T Lim] 14:11:42
        There that we worked through, And we learned about how to interact with the vendors, what mechanisms and and ways of working with them.

        [K-T Lim] 14:11:53
        worked well for us and for them. The second engagement was with a different vendor.

        [K-T Lim] 14:12:00
        We tested workflow execution, Middleware. This is some of our custom.

        [K-T Lim] 14:12:05
        middleware at a modest scale, up to about 1,200 virtual cpus, and we were able to make use of spot or preemptable instances, to run a lot of our processing It's easy to retry something a a particular quantum of processing if you feel for some reason if

        [K-T Lim] 14:12:23
        the if the processor went away, and that reduced system by a considerable amount.

        [K-T Lim] 14:12:29
        When you allow preemption that way, and the third engagement we tested improved workflow execution, middleware.

        [Enrico Fermi Institute] 14:12:36
        Yeah.

        [K-T Lim] 14:12:37
        so actually at a similar scale, up to 1,600 Vcpus.

        [K-T Lim] 14:12:45
        and here we we also did some transfers over the long call network again, and learned about the desirability of having http to persistent connections for uploading to object stores in particular sure so all of these taught us something about working with the cloud and some

        [K-T Lim] 14:13:07
        of the things that we learned that people don't necessarily talk about a lot, or that when we were able to work with a vendor who had relatively low bureauucracy high flexibility and a willingness to assist you know at our well to find point of

        [K-T Lim] 14:13:26
        contact and rapid internal processes that made things work much more smoothly.

        [K-T Lim] 14:13:33
        As we went through these engagements, and through a subsequent working with these vendors, deep engagement with the vendors and engineering teams, being able to talk to the actual product managers and even in in some cases engineers who are working on these products, also was useful and something that turned

        [K-T Lim] 14:13:53
        out that was quite unexpected is that consultants can also be very useful.

        [K-T Lim] 14:13:59
        So there are a number of consultants who are obviously fully trained and certified for building things on these vendors.

        [K-T Lim] 14:14:08
        Clouds, there. They don't know any more than the you know.

        [K-T Lim] 14:14:14
        People, at the vendors necessarily, but unlike the vendor engineers, they are allowed to work on your code The The vendors can't work on your code that that would cause.

        [K-T Lim] 14:14:26
        All kinds of problems, especially since in our case all of our code is open source.

        [K-T Lim] 14:14:31
        but the consultants can. They can actually modify things and and update your own code to work better in the cloud.

        [K-T Lim] 14:14:40
        And so that was that was something that turned out to be very interesting.

        [K-T Lim] 14:14:47
        we did a lot of cost modeling, so we have already had very complex internal spreadsheets to understand what our data sizing and compute sizing requirements would be.

        [K-T Lim] 14:15:02
        we adjusted them somewhat to fit the cloud storage models, and and how things will work there, and the vendor our vendors also produced spreadsheets that Then match.

        [K-T Lim] 14:15:17
        Those needs to the available technology cheese and their their quoted prices for them.

        [K-T Lim] 14:15:24
        so in our case, our compute costs compared with high energy.

        [K-T Lim] 14:15:29
        Physics are not that large, and we're talking only something in the millions of core hours, and that's only in year 10 of the survey.

        [K-T Lim] 14:15:37
        when we're doing the maximum memory processing of of the entire survey contents.

        [K-T Lim] 14:15:42
        So that's quite reasonable. The storage costs, on the other hand, for frequently access data turned out to be a major. Problem.

        [K-T Lim] 14:15:50
        we are expecting to have hundreds of petabytes of results that are both sort of in process that are that are the that are being developed.

        [K-T Lim] 14:16:01
        For the next data release as well as the results that are part of the previous data releases that are already public.

        [K-T Lim] 14:16:08
        So those force costs can be very large. And I we have had a number of you know.

        [Enrico Fermi Institute] 14:16:13
        Okay.

        [K-T Lim] 14:16:19
        It's kind of debates about why the on prem storage calls seem to be less than the in cloud storage costs.

        [K-T Lim] 14:16:26
        I mean, in some cases it's because the total cost of ownership is, it's somewhat different.

        [K-T Lim] 14:16:31
        sometimes things like people, like administrators, can be charged different accounts, and they don't actually fall under their projects.

        [K-T Lim] 14:16:37
        Budget, but I think a lot of it is also that in the cloud you're paying for more durability and performance.

        [K-T Lim] 14:16:44
        The then we often need in science, right? So in sense, we often have replicas of the data in other places, so we don't need, you know, 8 nine's or something like that worth of of during our Bill in one place and also we can schedule, when we're going to access

        [K-T Lim] 14:17:02
        data often, and so we don't need the kind of perfect that you might need for commercial workloads.

        [K-T Lim] 14:17:09
        Egress. Costs are often a problem, but there are mitigations that can exist, and of course, if you can keep most data, that's most data transfers, either inbound to the cloud or or entirely within.

        [K-T Lim] 14:17:21
        The cloud. Then there are no egress costs, and so that helps a lot.

        [K-T Lim] 14:17:26
        if we managed to do most of the data summarization and visualization within the cloud, and then only have the results.

        [K-T Lim] 14:17:33
        Exit that also limits the the egress quite a bit.

        [K-T Lim] 14:17:38
        the vendors tend to give credits for egress based on the total amount of spending that you're doing on all the other services that you're buying from them.

        [K-T Lim] 14:17:47
        And so those credit is, can also help minimize the egress costs.

        [K-T Lim] 14:17:51
        And finally we did look at, but have not yet moved on getting a dedicated interconnect.

        [K-T Lim] 14:17:58
        So with a dedicated intercourse, connect, you're not using the public Internet or or the the public egresses, and as a result, there can be substantial discounts on the egress, costs because it's a kind of paid for in a lump sum rather than on a per

        [K-T Lim] 14:18:14
        byte or per gigabyte basis. So the final decision that we made was to have a hybrid model.

        [K-T Lim] 14:18:21
        So we have most of the storage and the large scale compute.

        [K-T Lim] 14:18:25
        And I'll explain why. In a second on Prem, at the Us.

        [K-T Lim] 14:18:29
        Data facility which is located at slack national accelerator lab.

        [K-T Lim] 14:18:33
        so The users, however, will be supported in the cloud on a cloud.

        [K-T Lim] 14:18:37
        Data, facility, not that is, actually vendor agnostic, but we're anticipating that it will be at the on Google side platform for various reasons.

        [K-T Lim] 14:18:47
        And so it looks something like this: We have again the telescope sending data to the Us.

        [K-T Lim] 14:18:52
        Data facility and the data release processing and prom processing both occur there with the main archive storage.

        [K-T Lim] 14:19:00
        But in the cloud data facility, we have the ribbon Science platform services.

        [K-T Lim] 14:19:05
        we have a cash of data that's both, for for relatively small data sets that we can copy in their entirety and for partial storage of other data sets that are than are being used frequently and also per user storage would also be

        [K-T Lim] 14:19:28
        stored entirely in the cloud. This shows the user batch would be executed at the Us data facility so that it could run against the archive storage.

        [K-T Lim] 14:19:37
        And I'll talk about that, and where those dividing lines might be in a bit

        [K-T Lim] 14:19:42
        So. The ribbon Science platform What is it really? Again, It's for us.

        [K-T Lim] 14:19:48
        It's for our science users who are coming to use dedicated resources that are provided by the project to access our large data sets and use web based applications on them.

        [K-T Lim] 14:20:01
        So there's a core. Yeah, where it provides access and visualization and sort of structured expeditions through the through the data set with query generation tools as well as lots of visualization, including joint visualization visualization, of images and

        [K-T Lim] 14:20:23
        catalogs we then have Jupiter notebooks.

        [K-T Lim] 14:20:29
        it's actually quite common now. Was not that common?

        [K-T Lim] 14:20:31
        a few years ago, when we were starting out on this vision.

        [K-T Lim] 14:20:37
        But that's for more ad hoc analysis by users, and then we have web Api's.

        [K-T Lim] 14:20:43
        These are web services that are, have interfaces that are defined by the international Virtual Observatory Alliance for astronomy that provide access to both images, both raw and processed images as well as catalogs of things, seen on those images and so that

        [K-T Lim] 14:21:04
        provides. Excuse me, both remote access and and a little bit of processing and that we can do things like.

        [K-T Lim] 14:21:11
        Cut it out, sections of images or paste together images.

        [K-T Lim] 14:21:15
        So this is the the user experience. The users will have, And behind those 3 major aspects there are the data releases an alert filtering service user databases user files, all kinds of other infrastructure that's necessary to make that work.

        [K-T Lim] 14:21:33
        So our uses of cloud services. Obviously the primary one is going to be the revenue and science platform.

        [K-T Lim] 14:21:41
        the reasons for putting this in the cloud alright include these.

        [K-T Lim] 14:21:46
        So there's security by putting this in the cloud.

        [K-T Lim] 14:21:48
        we can use separately manage identities that have nothing to do with the identities at our on-prem facilities at slack.

        [K-T Lim] 14:21:59
        so all of our users do not need to get slack accounts.

        [K-T Lim] 14:22:03
        this is very important, because department of energy has a lot of restrictions and and it's not necessarily very rapid at generating accounts at labs.

        [K-T Lim] 14:22:14
        So, having being able to maintain our own accounts, makes things much more efficient, and allows us to integrate with things like federations that we couldn't otherwise necessarily do.

        [K-T Lim] 14:22:26
        it also means we have a good story for cyber security at the lab, because we have very relatively limited interfaces with the onframe facilities.

        [K-T Lim] 14:22:34
        There are dedicated, there are certain services that be queried from the cloud, and those can be listed and tracked, and understood.

        [K-T Lim] 14:22:46
        A huge benefit is elasticity right? So, especially after we have an annual data release, we're expecting the hordes of astronomers.

        [K-T Lim] 14:22:55
        Will descend on us and and want to look at what's new in that release.

        [K-T Lim] 14:22:59
        this might also happen. For example, around key conferences, when people are trying to do work So in the cloud we have essentially infinite elasticity.

        [K-T Lim] 14:23:07
        We can see up the rubber science platform by deploying more notebook servers, more api servers. And even more portal servers arbitrarily, and so you can hint we're expected to be able to handle those loads relatively easily the back end

        [K-T Lim] 14:23:26
        services. It could be an issue. But we can. We can also do that in a scalable manner, using object stores, and scalable files, scalable distributed file systems and a scalar distributed database in the back end potential advantage.

        [K-T Lim] 14:23:47
        we are looking to prove, but haven't quite yet.

        [K-T Lim] 14:23:51
        is that you could bring your own resources in the cloud.

        [K-T Lim] 14:23:56
        So if a science user had a grant or some other means of providing cloud resources on the same cloud, the cloud vendor that we're using, they confederate those resources with the ones that are already press events for the ribbon science platform and essentially expand their

        [K-T Lim] 14:24:18
        capabilities. Okay, kind of compared with trying to actually purchase hardware and slack lab or or send computers or something like that. This is much.

        [K-T Lim] 14:24:29
        Much much much much easier. And so it gives people the ability to use all the same facilities, software and user interfaces that they're familiar with at a larger scale, just by adding on to what's present And finally the cloud can also provide access to new technologies things like

        [K-T Lim] 14:24:52
        gpus Tpus or software technologies like sophisticated infrastructure services that are harder to deploy at a lab on premises.

        [K-T Lim] 14:25:03
        that's and again you don't need to buy them and keep them working.

        [K-T Lim] 14:25:10
        31. You can rent them when you need them, and then further away.

        [K-T Lim] 14:25:15
        So for the large scale compute we have executed fairly large production.

        [K-T Lim] 14:25:25
        This is our data preview 0 point 2, which is only for 5 years and only for a small portion of the sky, not not the full data release.

        [K-T Lim] 14:25:35
        Production but we did. We're able to actually done on larger numbers of nodes, 4,000 virtual cpus again.

        [K-T Lim] 14:25:42
        Not that much compared to high energy physics necessarily, but pretty large for what we're doing, and but we're not expecting to execute the main survey data release production.

        [K-T Lim] 14:25:51
        On this the cost of storing or addressing the large, their process products is too excessive to do that.

        [K-T Lim] 14:25:58
        we might be able to do user batch in the cloud.

        [K-T Lim] 14:26:00
        But it. It'll have some of the same drawbacks, and that works expecting user batch jobs to also want to process, large fractions of the the available data.

        [K-T Lim] 14:26:11
        And so transmitting all of those into the cloud, or and even storing them temporarily, can have some difficulties.

        [K-T Lim] 14:26:20
        But if we were able to do it, if we can get the caching and the sort of automated transfers working well, then there would be the advantages of having the again the security and technology kind of advantages that we would not have on premises right now

        [K-T Lim] 14:26:43
        we're going to require the users who wants to execute that large scale?

        [K-T Lim] 14:26:49
        Batch jobs get slack accounts, and that made may eventually become a problem

        [K-T Lim] 14:26:55
        we've found the cloud to be extremely useful for development. Testing.

        [K-T Lim] 14:26:58
        Again the elasticity being able to scale up at at will, and that technology advantages of being able to use new machines, large amounts of flash storage, for example, things, like that that are not easily purchased especially now with supply chain issues in an on-premise model

        [K-T Lim] 14:27:21
        has been very helpful for development, and we've also been able to do things like rapid prototyping with advanced services such as server lists, all kinds of deployments.

        [K-T Lim] 14:27:31
        There is a possible future I mentioned. We have a distributed, scalable database that runs on on-premises that will handle and serve a lot of the catalogs that are being generated for the stores and galaxies that we're detecting on these images that

        [K-T Lim] 14:27:48
        database has been customized for astronomy, and has a lot of advantages.

        [K-T Lim] 14:27:53
        one is that it does spherical geometry, which it's kind of difficult and a lot of in a lot of databases alright.

        [K-T Lim] 14:28:02
        It does share. It's what cost shared scans, where multiple queries that are touching the same tables.

        [K-T Lim] 14:28:09
        a do share ios essentially, and that makes things much more efficient and can provide, I guess, well understood maximum query times, while the minimum query times may increase the maximum query times for certain types of queries can be limited and so

        [Enrico Fermi Institute] 14:28:31
        Okay.

        [K-T Lim] 14:28:32
        we can, we can guarantee that your query will finish in a certain amount of time.

        [K-T Lim] 14:28:38
        we also have special indexes, especially spatial ones, that allow us to do astronomical types of queries much more efficiently so.

        [K-T Lim] 14:28:49
        A lot of these differentiators are kind of going away with cloud deployments, Spherical geometry is becoming more available through gis kinds of packages.

        [K-T Lim] 14:28:59
        The shared scan is still a win. But when everything is on Nvme flash the number of iops is so high, that you know you can do individual ios for each query without actually loosing a lot a special indexes that we have are still

        [K-T Lim] 14:29:18
        a bit of an issue are are still better in house than they are in the cloud, and retrofitting them to the cloud.

        [K-T Lim] 14:29:25
        Databases is difficult, and finally, storage costs, Course can still be an issue again, because week we can do this cheaper, and in-house rather than using the cloud storage, and then finally users, in the cloud for archival or tape replacement storage

        [K-T Lim] 14:29:47
        maybe comparable in terms of total cost of ownership.

        [K-T Lim] 14:29:51
        This is something we're still investigating, especially if you don't retrieve the data.

        [K-T Lim] 14:29:56
        If you do have to retrieve the data, then there are large egress costs again to get it out of the cloud and engineer on premises, storage and so that, becomes an issue.

        [K-T Lim] 14:30:07
        But if you're in that kind of a disaster situation, it may not be that bad

        [K-T Lim] 14:30:15
        one other aspect of the cloud that has been kind of important, I guess, is that is reliability.

        [K-T Lim] 14:30:22
        so, while I mentioned that in some cases the durability of storage might be overkill and other faces.

        [K-T Lim] 14:30:29
        Well, we do actually experience higher higher reliability and higher ability to deliver, to our end users by deploying on the cloud than on premises.

        [Enrico Fermi Institute] 14:30:37
        Okay.

        [K-T Lim] 14:30:38
        first of all, one of the the may be negatives is that we've seen that kubernetes upgrades will roll through Our clusters semi arbitrarily there are some controls that you can put on them.

        [K-T Lim] 14:30:50
        But the the vendors kind of want to update it when they want to update it.

        [K-T Lim] 14:30:55
        So we need to make sure we've designed services to deal with these kinds of rolling outages.

        [K-T Lim] 14:31:01
        not all of them are, but we can. We will adjust them over time again.

        [K-T Lim] 14:31:06
        The durability of storage is extremely high, maybe more than necessary service outages are quite rare and usually short compared with some of the outages that we've had on prem and 24 by 7 support for basic infrastructure and even for higher level services is often better than we have

        [K-T Lim] 14:31:24
        on prayer, where it may just be 8 by 5 essentially So while sometimes the reliability in the cloud is more than you need, and so you're paying for more, than you actually need.

        [Enrico Fermi Institute] 14:31:33
        And

        [K-T Lim] 14:31:41
        In other, cases, it's actually it can be an event so we're trying to wrap up here.

        [K-T Lim] 14:31:47
        conclusion and status and Plans The hybrid model seems to be suitable for our use.

        [K-T Lim] 14:31:52
        Cases. We are practicing today with an interim data facility on the Google Cloud platform, which hosts simulated data until the telescope is built.

        [K-T Lim] 14:32:04
        we're we're working with that. So, But to give scientists a chance to work with data that looks like the real thing.

        [K-T Lim] 14:32:09
        And and using all the tools that they will eventually have.

        [K-T Lim] 14:32:13
        We're building out our back end on prem infrastructure to practice the integration with the cloud and tune, the various caching parameters.

        [K-T Lim] 14:32:23
        About What gets sent to the cloud went, and we are obviously continuing to track development and cloud services and pricing.

        [K-T Lim] 14:32:32
        and I'm happy to answer any questions

        [Enrico Fermi Institute] 14:32:41
        Okay. So we have a couple of hand-raced Lindsey.

        [Enrico Fermi Institute] 14:32:45
        Why don't you go first

        [Lindsey Gray] 14:32:48
        yeah, sure, actually, just a quick operational one about the fact that they're rolling through kubernetes, upgrades kind of at that.

        [Lindsey Gray] 14:32:57
        Their own whim in particular, since Kubernetes is up updating the spec and the interface that you're talking to how much maintenance burden have you found that to be as the spec makes backwards and compatible changes

        [K-T Lim] 14:33:16
        the the upgrades of Kubernetes itself have typically not been too much of a problem.

        [Enrico Fermi Institute] 14:33:17
        See.

        [K-T Lim] 14:33:26
        we don't run on the the we, well, we run our development clusters on sort of the latest, more bleeding edge versions, and our production clusters on the more stable versions.

        [K-T Lim] 14:33:41
        So we've typically seen any problems already, either in the development clusters or even at the summit where we're probably a little bit more rapid to update Then then, even on the stable channels in the cloud.

        [K-T Lim] 14:33:55
        So we're been really prepared for any of these things that happen

        [Lindsey Gray] 14:33:58
        okay. So the there's nothing really nasty about the cadence of updates or upgrades on the cloud side of things.

        [Lindsey Gray] 14:34:05
        And you have. You feel like you have control over the situation by and large, alright, cool.

        [Lindsey Gray] 14:34:10
        Thank you.

        [K-T Lim] 14:34:10
        Adequately. Yes, you do have to have people that are dedicated to to so keeping this up to date and porting them How You I know that There are some people in science who like the idea of well we're going to install our service on the machine and then sort of wrap the

        [K-T Lim] 14:34:26
        whole thing in amber and just kind of leave it there and have it run.

        [K-T Lim] 14:34:30
        And it the model. It cannot be that way. You have to deal with Os upgrade updates and and service updates, and be on top of

        [Lindsey Gray] 14:34:33
        Right.

        [Lindsey Gray] 14:34:40
        Okay, Cool: Thank you.

        [Enrico Fermi Institute] 14:34:44
        Okay, Tony.

        [Tony Wong] 14:34:46
        yeah, Oh, hi! So I I I got a basically A, you know, a couple of questions in rolled up into one with respect to storage, because you kept mentioning, you know, concerns about to the cost of egress and storage So you know, when I looked at Google and and Amazon, I noticed that they

        [Tony Wong] 14:35:05
        have many, many different levels of storage reliability, you know responsiveness, you know, backups, you know, 24 by 7 availability, etc., and so forth.

        [Enrico Fermi Institute] 14:35:16
        Okay.

        [Tony Wong] 14:35:16
        Data, 0 do a study to determine which point in terms of the levels of of storage services that it needs where Cloud would be more, you know, will look more favorable, cost-wise compared to on-premise And you know and then if you if you look at the level

        [Enrico Fermi Institute] 14:35:22
        Yeah.

        [Tony Wong] 14:35:40
        of service, then the other opposite side of the question is also: Did we all do a study to optimize costs with regards to social media most of the storage is going to be on prem but some of the storage is going to be on the cloud.

        [Tony Wong] 14:35:56
        So at what point is there? Is there a tipping point where a big pace to have a cloud services?

        [Tony Wong] 14:36:03
        You know storage, and how much does it Really, you know, Is it 10% of your data?

        [Tony Wong] 14:36:08
        20% of your data. At what point does it really pay to go on the cloud

        [K-T Lim] 14:36:14
        Okay, So yeah. A couple of things. First of all, just something that that our communications people make me say we.

        [K-T Lim] 14:36:23
        We try not to use abbreviations for the name of the observatory.

        [K-T Lim] 14:36:26
        We prefer to just have a call. The Vera Rubin Observatory rather than vro Second of all, Yeah, we did extensive modeling of how frequently our data is going to be accessed, because the there are obviously a different levels.

        [K-T Lim] 14:36:49
        Of oh, of access and and different prices for those are all ranging all the way from sort of standard object store in which there's actually even above that, there's there's positive file systems which tend to be quite expensive object store, which is frequently accessed, where you don't get charged

        [K-T Lim] 14:37:10
        so much per operation, Are you gonna charge? A tiny bit per operation, but not providing.

        [K-T Lim] 14:37:16
        and then, and and then there's a large rental cost for bytes per month, all the way down to the archival storage.

        [K-T Lim] 14:37:25
        cold storage, where you get charged a lot for retrieval, but considerably less for the actual storage per month, and so we looked at for each of our data, sets how many accesses would, we be expecting to have and as a result.

        [K-T Lim] 14:37:44
        What category of stores could we use for them? Unfortunately, a lot of our data sets.

        [K-T Lim] 14:37:51
        we don't really know. We know that what kinds of kinds of access patterns we have for our own processing, for the data releases or for yeah, they're prom processing or even somewhere for our device, But the science users might do anything And they can look at any data.

        [K-T Lim] 14:38:11
        anytime, and we it's hard to say which pieces of data will be accessed more than others.

        [K-T Lim] 14:38:18
        So a lot of that ends up kind of being attributed to the the most expensive object store.

        [K-T Lim] 14:38:27
        As a result, and again, as I mentioned some of the the total cost of ownership concerns or total cost of ownership, that the cloud vendors are charging may not be fully charged to the projects when it's on premises and as a result our

        [K-T Lim] 14:38:45
        on-premise costs considerably lower for those kinds of things we did, I mean so we we have those numbers in terms of comparison of on-prem, and in the cloud they will differ, depending on your institution.

        [K-T Lim] 14:38:59
        And the cost of hardware. I think we've been getting good deals from the vendors when they can actually deliver.

        [K-T Lim] 14:39:09
        so the the we we have, I mean. We know at what dollar per month cost.

        [K-T Lim] 14:39:19
        It would make sense to switch and we're not anywhere in the ballpark right now.

        [K-T Lim] 14:39:24
        but, as I said, for the archives, storage, when you compare it with tapes and tape robots and tape drives, and things like that, it may become that may have crossed over in terms of being cheaper in the long run to stored in the cloud than it

        [K-T Lim] 14:39:46
        is to store it on print, but some of that depends on what you assume in terms of how often you're going to retrieve it.

        [K-T Lim] 14:39:52
        Originally we were going to write all the raw data to tape immediately, and then actually reread it every year to do the reprocessing which would both guarantee that it was actually readable as well as as make the the costs of of retrieval

        [K-T Lim] 14:40:14
        make, Mcdonald's of storage of that raw data lower.

        [K-T Lim] 14:40:17
        But it turns out that the raw data now is, as time is going on, and it's not too bad to actually store it spinning.

        [K-T Lim] 14:40:26
        There are other reasons to store it spinning, and so we will not be doing that.

        [K-T Lim] 14:40:30
        So the the actual number of retrieval from tape is hoped to be near 0,

        [K-T Lim] 14:40:41
        Does that answer Enough of your question? I'm sorry

        [Tony Wong] 14:40:42
        Yes, it does. I think it is very informative. Thank you.

        [Enrico Fermi Institute] 14:40:47
        Right, Dirk. You want to jump in

        [Dirk Hufnagel] 14:40:50
        yeah, I had to relate related questions, and they're both on cloud cost What I was curious about is how you do budget for like cloud what you spend on cloud throughout the year.

        [Dirk Hufnagel] 14:41:06
        If you, if you tell yourself like for this year, we want to spend.

        [Dirk Hufnagel] 14:41:09
        We have this budget for cloud or if you're a bit more flexible where you allocate funds for on-premise or cloud throughout the year, and then, related, independently, of how you set that target we actually control your cost especially in light of still keeping the ability to support these elastic

        [Dirk Hufnagel] 14:41:26
        use cases, because I mean at some point, if you account, if you available money goes to 0, you can't really be fully elastic anymore.

        [Dirk Hufnagel] 14:41:32
        So

        [K-T Lim] 14:41:33
        Yes, yeah, the way that our budgeting works. We have separated out the cloud and the on-prem costs.

        [K-T Lim] 14:41:44
        so the they are not one. I mean. They're they are stemming from an original budget.

        [K-T Lim] 14:41:51
        But that budget has been divided up relatively early, and we've actually one of the ways of getting a discount from the Cloud provider.

        [K-T Lim] 14:42:03
        In this case, Google Cloud Platform was to provide a commitment that we would spend a certain amount.

        [K-T Lim] 14:42:08
        So we've already kind of pre budgeted.

        [K-T Lim] 14:42:12
        That amounts for a number of years, in order to to be able to to get substantial discounts, We don't expect them to be problem.

        [K-T Lim] 14:42:24
        One of the nice things is that it is just fungible dollar amount, and we can spend that on any services.

        [K-T Lim] 14:42:32
        so if we decide we we don't want what we originally wanted.

        [K-T Lim] 14:42:36
        We want to change it to something else. That's no problem.

        [K-T Lim] 14:42:40
        it's all the same dollars the in terms of the elasticity and budgeting.

        [K-T Lim] 14:42:48
        It is true that we do have to put in to place quotas and throttles, so that our users can't just all chew up the entire budget in the first, week and so they're they're too, need to be controls like that that are imposed in

        [K-T Lim] 14:43:05
        the services.

        [Enrico Fermi Institute] 14:43:10
        Okay.

        [K-T Lim] 14:43:11
        But that would apply on Prem as well. It's also not infinitely adjustable

        [Dirk Hufnagel] 14:43:13
        but on premise at least, it's limited.

        [Dirk Hufnagel] 14:43:17
        You can't run more than what the deployed hardware allows you. Do.

        [Dirk Hufnagel] 14:43:22
        You have enough data to to do to say how well your controls work

        [K-T Lim] 14:43:29
        so far again, because most of the access has been by friendly users and staff.

        [K-T Lim] 14:43:35
        we have not had any problems. Our Our budget is is large enough to cover all the uses that people have been have been doing of it.

        [Dirk Hufnagel] 14:43:38
        Okay.

        [K-T Lim] 14:43:46
        but the we are tracking on a weekly basis.

        [K-T Lim] 14:43:51
        How much we have been spending, and if anything looks like it is getting out of sync, then we can investigate and try to try to control that in terms of actually implementing the the throttles in the services the user facing services.

        [K-T Lim] 14:44:10
        we only have a couple of them in place right now and again.

        [K-T Lim] 14:44:15
        They have not been triggered, so it's hard to say

        [Enrico Fermi Institute] 14:44:18
        Yeah.

        [Dirk Hufnagel] 14:44:18
        Okay, thanks.

        [Enrico Fermi Institute] 14:44:22
        Okay, Done

        [Douglas Benjamin] 14:44:24
        Yeah, thanks for the nice talk and the the details. Alright on slide 10.

        [Douglas Benjamin] 14:44:31
        You sort of list the reasons why you know the use of the Cloud services, and on also the advantage, the advantages, it said security.

        [Douglas Benjamin] 14:44:41
        But that really meant authorization, right in the sense of you have 2 different user communities to get access to on.

        [K-T Lim] 14:44:41
        Yeah.

        [Douglas Benjamin] 14:44:52
        Do we site Resources requires a different level of authorization even within the like host lab itself, and necessarily what is currently available will fit it.

        [Douglas Benjamin] 14:45:06
        Identity and the moves with federated identity across the Us.

        [Douglas Benjamin] 14:45:13
        Dewey. Complex. Change that calculation at all for you

        [K-T Lim] 14:45:19
        So there are. There are 2 things there

        [Douglas Benjamin] 14:45:23
        Because I can't imagine the cloud is more secure.

        [Douglas Benjamin] 14:45:24
        Given all the acts of what have happened with cloud vendors and the services on them

        [Enrico Fermi Institute] 14:45:29
        Okay.

        [Douglas Benjamin] 14:45:32
        Then you imply with your statement

        [K-T Lim] 14:45:35
        yeah, yes, to no, I mean so off, first of all.

        [K-T Lim] 14:45:41
        Okay, let me say that at least the the cloud vendors are, I mean.

        [K-T Lim] 14:45:56
        Well, I think they have the ability to be highly secure, whether they're delivered.

        [K-T Lim] 14:46:02
        commercial level services are highly secure or not, or another question, but I would, I mean, even though our lab has a great cyber security team, I would still say that the cloud vendors are devoting more resources to, security, than we are.

        [K-T Lim] 14:46:18
        and so the it would seem that it would be at least comparable.

        [K-T Lim] 14:46:23
        that's sort of on a overall holistic level.

        [K-T Lim] 14:46:26
        in terms of the authorizations. Yes, when I mean one of the things that we're that we're doing is that the the Ruben data data sets are open to all astronomers within the us and Chile as well as selected names Astronomers from international

        [K-T Lim] 14:46:52
        partner, institutions. And so, even if we had an authentication of federation for the Us.

        [K-T Lim] 14:47:04
        Which we are point we are using in common and planning to use equivalents in Chile and in Europe.

        [K-T Lim] 14:47:12
        the the authorization part of it is still sort of unique to the project, And it also may include international partners who have some difficulties with the Us.

        [K-T Lim] 14:47:32
        Government or with us, labs

        [Douglas Benjamin] 14:47:33
        Yeah. I Msn: from the the cut the country. She's non grata

        [K-T Lim] 14:47:36
        Yeah, and so those I mean, sometimes not not even countries per se.

        [K-T Lim] 14:47:45
        But citizens of those countries who are working at other institutions, so that that can be better.

        [Douglas Benjamin] 14:47:49
        Right.

        [K-T Lim] 14:47:54
        And then the final part. The final thing is that in that security bullet Well, I guess I can pull it up again.

        [K-T Lim] 14:48:00
        But

        [K-T Lim] 14:48:04
        Where to go.

        [K-T Lim] 14:48:11
        Here we go. So I mentioned the limited interfaces, so I think that even if we get, and we kind of expect to have our user facing services packed eventually, just because a lot of it is even even if it's built on vendor infrastructure it's still our code and

        [K-T Lim] 14:48:33
        we could still have security problems. So even if that's hacked the interfaces with the on-prem facilities are limited in terms of what they can do, what they can extract, and the what they're extracting is public data or essentially public data.

        [K-T Lim] 14:48:51
        Anyway, it's screwed data rights holders. And so the the exposure to the lab is considerably smaller than if people are actually logging into machines at the lab and have access to the networks in a more general way etc. does that make sense

        [Douglas Benjamin] 14:49:07
        Hey! How'll make sense? I I realize that one of the difference between Reuben and the root, the science community and Ruben is a little bit different than me mit experiments, because it's one overall arching organization more people have certain accounts than don't on the Lhc

        [Douglas Benjamin] 14:49:29
        experiments. For example, the last quest, quick question I had on this.

        [K-T Lim] 14:49:31
        Yeah.

        [Douglas Benjamin] 14:49:35
        Was you also implied that storage costs are one of those things that you're worried about eating budget.

        [Douglas Benjamin] 14:49:40
        So that's why you limited You essentially. Have a lot of the story.

        [Douglas Benjamin] 14:49:47
        Most of this

        [Douglas Benjamin] 14:49:51
        The size of the storage, and the cloud seem to be limited relative to what you have on Prem.

        [K-T Lim] 14:49:58
        Right.

        [Douglas Benjamin] 14:50:00
        And is that really? For did I get the sense that it is to sort of cost containment for storage

        [Enrico Fermi Institute] 14:50:01
        Okay.

        [K-T Lim] 14:50:07
        Yes, I mean, basically we. We have hundreds to even thousands of petabytes of data that we were expecting to store by the end of the survey, and that cost of storing all of that for months and months and months in the cloud would be excessive compared to our

        [K-T Lim] 14:50:30
        budget. So we are storing on the order of, you know, single digit percents of that in the cloud with the rest of it on prem, in order to fit within the available budget

        [Douglas Benjamin] 14:50:50
        Thank you very much

        [Fernando Harald Barreiro Megino] 14:50:56
        Yeah. Yeah. Simple question, Where did you choose? Google over other cloud?

        [Fernando Harald Barreiro Megino] 14:51:04
        Then there was just so the best deal, or anything else

        [K-T Lim] 14:51:10
        0 are a number of reasons, I forget whether we actually have.

        [K-T Lim] 14:51:15
        I don't think we have a a public document that actually states this.

        [K-T Lim] 14:51:23
        yeah. But I can see, I think, that there were a number of factors Pricing is only one of them, and as I mentioned the the ability to work with Google, and there are flexibility and ability to to work well, with our engineering teams, was

        [Enrico Fermi Institute] 14:51:31
        Hmm.

        [K-T Lim] 14:51:58
        actually a fairly major factor as well. The quality of their services.

        [K-T Lim] 14:52:05
        in some cases, I mean a lot, a lot of the things that we use.

        [K-T Lim] 14:52:08
        We try to be vendor, agnostic, and so the interfaces are all the same.

        [K-T Lim] 14:52:15
        But the performance underlying those interfaces can change from from vendor to vendor. And I guess it's something that I've actually kind of complained to Google about sometimes that that they often seem to offer a much better product.

        [K-T Lim] 14:52:31
        Than we need, but at a somewhat more expensive price. And so sometimes, though they're taking advantage of those, those improved performance formus capabilities can actually be useful.

        [Enrico Fermi Institute] 14:52:38
        Thank you.

        [K-T Lim] 14:52:44
        so for things like, yeah, I mean, just as an example of the the object store retrieval, even for the the sort of the coldest archive level of storage the latencies to actually retrieve the data can be almost the same as for normal object store on Google cloud whereas

        [K-T Lim] 14:53:03
        Amazon's glacier. There are much greater latencies, most of which we wouldn't mind, but it can be nice to have that capability to just grab something if you you need it every once in a while.

        [K-T Lim] 14:53:15
        So they're they're very a variety of reasons like that.

        [K-T Lim] 14:53:20
        I don't think I can go and say everything that in terms of the vendor comparisons at this time, just because I don't think it's all public

        [Enrico Fermi Institute] 14:53:29
        So. So it was very Ruben using the subscription model with Google

        [K-T Lim] 14:53:37
        subscription model in terms of what, where we have.

        [K-T Lim] 14:53:42
        a sort of it's not a prepaid kind of plan, but it?

        [K-T Lim] 14:53:49
        Is, there is a commitment to spend a certain amount over a certain amount of time, and and we get a number of discounts as a result of that

        [Enrico Fermi Institute] 14:53:59
        So so have you done that. Has that negotiation happened like one?

        [Enrico Fermi Institute] 14:54:05
        Or as it is it then, you know, have you gone back, and you know you committed to doing something for some amount of time, and then you went back and negotiated a new deal of you have you done anything like that, and if you have has it been have there been significant like deltas in the

        [Enrico Fermi Institute] 14:54:21
        pricing, or anything like that

        [K-T Lim] 14:54:23
        Right. So we have done that once for the interim data facility, which was for a period of 3 years, and that time period is kind of coming up soon.

        [K-T Lim] 14:54:35
        we are in the middle of working on purchasing a of doing a similar kind of purchase plan for the beginning of operations of the survey, which will, start in 24, and that negotiation I mean, it's it's actually a question there's there's some question

        [K-T Lim] 14:54:58
        about whether we can do it as a source agreement that's being thought about as well as what the pricing will be.

        [K-T Lim] 14:55:07
        We are in our discussion. So far, we expect that the person will be, if anything lower, then we're we're getting it before for a number of things.

        [K-T Lim] 14:55:17
        we are cognizant and worried about potential for vendor lock in.

        [K-T Lim] 14:55:23
        We've had cases with unnamed databases where licenses have radically escalated in price, and with other systems, where similarly, there can be unexpected costing, increases.

        [K-T Lim] 14:55:38
        we don't necessarily see that happening here for a number of reasons.

        [K-T Lim] 14:55:41
        Again, We're trying to use commodity interfaces and services that we could get from another vendor if necessary, or even deploy ourselves on premise We absolutely had to and and they're also ones that are very commonly used commercially, they're not sort of unique

        [K-T Lim] 14:56:02
        to science in any way, and so there are additional pressures to keep the cost down that way.

        [Enrico Fermi Institute] 14:56:07
        Okay, Thank you.

        [Enrico Fermi Institute] 14:56:08
        Okay, thank you. Doug. Did you have another comment?

        [Douglas Benjamin] 14:56:10
        yeah, it's 2 questions. They're slightly different. One is sort of high level on the science platform in the cloud.

        [Douglas Benjamin] 14:56:19
        How long do you envision, negotiating your contracts with the vendor?

        [Douglas Benjamin] 14:56:25
        For is it like a 5 year period, and then you renegotiate after 5?

        [Douglas Benjamin] 14:56:29
        Or is it for the sort of lifetime of the data taking?

        [Douglas Benjamin] 14:56:34
        Because I know the data has to be around for much longer than the actual bucks bear.

        [Enrico Fermi Institute] 14:56:36
        Hello!

        [Douglas Benjamin] 14:56:40
        The The telescope, runtime

        [K-T Lim] 14:56:41
        Right? So there are 2 things there. One is that the survey is scheduled to run for 10 years We'll be taking data for 10 years, and we are on the project budget.

        [K-T Lim] 14:56:54
        We are committed to providing data products and services to the science users.

        [K-T Lim] 14:57:00
        For 11 years after the shortest survey operations.

        [K-T Lim] 14:57:04
        so 10, plus actually sorry. No, I see correct that.

        [K-T Lim] 14:57:10
        I think it's been updated to 12, years. So it's all it's 10 years of data taking one year of processing the 10 years.

        [K-T Lim] 14:57:16
        And then one year delivery of that data release. So that's 12 years.

        [K-T Lim] 14:57:21
        We will. There There is a a plan for for archiving this data.

        [K-T Lim] 14:57:28
        And and and preserving it for indefinite periods.

        [K-T Lim] 14:57:35
        After the end of the project. But that plan is not funded by the project.

        [K-T Lim] 14:57:39
        It has to be funded by the Nsf. Separately, and so we don't know exactly how that's gonna happen. Of course it is.

        [K-T Lim] 14:57:44
        you know 14 years in the future. Now the negotiation for the purchase of the cloud, so services will be for a particular term, and it's likely to be something on the order of 3 to 5 years, and then adjusted thereafter some of that is on the vendor

        [K-T Lim] 14:58:04
        side, actually that they are not sure how much they want to commit to.

        [K-T Lim] 14:58:09
        For for a long period of time. But okay, I think it also benefits us in that we have the opportunity to change and necessary

        [Douglas Benjamin] 14:58:19
        Okay? Next. And then my orthogonal question, was you mentioned?

        [Douglas Benjamin] 14:58:24
        Bring your own for the the science platform. Does that mean? Bring your own?

        [Enrico Fermi Institute] 14:58:28
        Yeah.

        [K-T Lim] 14:58:28
        Yeah.

        [Douglas Benjamin] 14:58:32
        Since you said, Google and your good, the science platform will be in Google.

        [Douglas Benjamin] 14:58:35
        Does that mean? Bring your own to Google? Or does that mean I'm a researcher at a university, and my university is made available.

        [Douglas Benjamin] 14:58:45
        Some compute resources for me to use, and I want to stitch my university stuff

        [K-T Lim] 14:58:52
        Yeah, So stitching the university stuff with the cloud platform will be a little bit more difficult.

        [K-T Lim] 14:59:02
        We will have interfaces to be able to do sort of bulk.

        [K-T Lim] 14:59:05
        Downloads of chunks of data that you might want to process locally.

        [K-T Lim] 14:59:10
        but that's not It's expected to be used mostly by certain collaborations that will be working to get and using large scale facilities like nurse.

        [K-T Lim] 14:59:25
        the what I was referring to there specifically was for a smaller scale, You know, collaborations or individual investigators who might have, you know.

        [K-T Lim] 14:59:40
        Do you do what each you know? I don't wanna spend $10,000 on a Google account and purchase compute, using that, and then federating that with the science platform and being able to expand their capabilities using all the same, tools that they're already using but just

        [K-T Lim] 14:59:59
        now having increased resources to be able to work with it just by having a separate account

        [Douglas Benjamin] 15:00:05
        okay, yeah, And so you for the the out of the scale, the out out of perfect Google premises stuff you're really assuming a certain size before it becomes tenable and feasible.

        [Douglas Benjamin] 15:00:20
        What I mean is there's a there's a certain number of scientists buying bound to work to get collaborate together to Do a you know, provide a certain amount of resources?

        [Douglas Benjamin] 15:00:30
        To do some work? Then it's worth your time to bring stuff in.

        [Douglas Benjamin] 15:00:34
        If someone shows up with, you know, 5 computers in their data center, you still gonna do that.

        [Douglas Benjamin] 15:00:42
        You know, there must be be a threshold

        [K-T Lim] 15:00:42
        Yeah, so yeah, if if they again, so, if they have 5, computers, and they wanted to download data that fits on those 5 computers.

        [K-T Lim] 15:00:52
        And then do whatever they want with it. That's fine.

        [K-T Lim] 15:00:55
        and there's no problem with that. If there. If they have, you know 10,000 cores, and you know they want to download a 100 petabytes worth of data. Then we need to chat with them about how we do that.

        [Douglas Benjamin] 15:01:08
        Okay, thanks.

        [K-T Lim] 15:01:13
        But I mean, again, the the the advantage here is is really about the flexibility of using any resources that are available in the cloud to work on the same data, because all the data is already there

        [Douglas Benjamin] 15:01:27
        but it's in the cl in the same cloud. Provider: Okay, thanks.

        [K-T Lim] 15:01:30
        In in the same cloud. Yes.

        [Enrico Fermi Institute] 15:01:33
        Okay.

        [K-T Lim] 15:01:34
        Yeah, I was looking at

        [K-T Lim] 15:01:40
        it's very interesting, but I think zoom with Bill is only usable to bring data into aws as far as I could tell, exporting.

        [K-T Lim] 15:01:54
        It is not so easy

        [Lindsey Gray] 15:01:57
        that Dna of a day. It's a it's a truck You should be able to go.

        [Lindsey Gray] 15:02:01
        Both directions

        [K-T Lim] 15:02:02
        Yeah, they they have an E ink label on it, though they're automatically changes to point to the local Amazon data Center

        [Enrico Fermi Institute] 15:02:13
        Okay, other questions or comments for Kt

        [Enrico Fermi Institute] 15:02:22
        Okay, if not, Katie. Thank you very much. This was really informative, and I think everybody learned a lot from this

        [K-T Lim] 15:02:27
        Thank you. I think there will also be talks, maybe on very similar.

        [K-T Lim] 15:02:32
        So topics at software and computing atlas workshop, I think there's in October and they're even talking about.

        [K-T Lim] 15:02:44
        Maybe one of us going to Cern in the spring for similar kinds of conversations.

        [K-T Lim] 15:02:49
        So perhaps we can also talk then, anyway. Thank you very much for inviting me

        [Enrico Fermi Institute] 15:02:52
        Great. Yeah, Thank you.

      • 13:20
        Policy, Follow-up 20m

        What features/policies would help CMS/ATLAS adopt cloud/HPC resources?
        - Greatly reduced egress fees?
        - Ease of access to allocations (vs competitive proposals)

        [Eastern Time]

         

        Additional Topics and Discussion

         

        [Enrico Fermi Institute] 15:02:58
        Yeah, So, have a few more things left in the slide that in the agenda let me share the screen again.

        [Enrico Fermi Institute] 15:03:13
        This was the the charge question that was posted before we had Katie's presentation.

        [Enrico Fermi Institute] 15:03:20
        The next thing was, you know, just kind of as we As we wrap up the the workshop, you know. Were there?

        [Enrico Fermi Institute] 15:03:28
        Were there, any, You know Other other topics, any other observations that folks wanted to bring up here?

        [Enrico Fermi Institute] 15:03:35
        I think it would be a great time to, you know, if there's anything that we haven't covered, that we should have covered. You know folks want to bring that up now. It's a great time

        [Paolo Calafiura (he)] 15:03:47
        maybe I missed it, and we discussed the about the next steps

        [Enrico Fermi Institute] 15:03:51
        Yeah, I have a slide for that. Actually.

        [Paolo Calafiura (he)] 15:03:54
        Good.

        [Enrico Fermi Institute] 15:03:56
        once you had something

        [Lindsey Gray] 15:03:57
        yeah, since Dale's still here, and I forget to ask it this morning.

        [Lindsey Gray] 15:04:03
        I wanted to understand. Going back to what you were talking about yesterday.

        [Lindsey Gray] 15:04:08
        I wanted to understand the process for setting up a level 2 or level 3 circuit, and if that just requires having you know, an appropriate certificate from an authority, and it can be, it can be largely automated or does someone actually have to physically appropriate things

        [Lindsey Gray] 15:04:27
        at some point in in whenever you're making every single circuit.

        [Lindsey Gray] 15:04:32
        How automated can that whole process be

        [Dale Carder] 15:04:35
        yeah, on the yes, that side for for layer 2. It can be fairly automated.

        [Enrico Fermi Institute] 15:04:40
        Okay.

        [Dale Carder] 15:04:41
        I forget off the end how it works, but that that's sort of tried true for layer through.

        [Dale Carder] 15:04:46
        That's to be defined. Typically, there's still other constraints.

        [Dale Carder] 15:04:53
        People might wanna consider such as is there adequate capacity on the physical infrastructure in which these services are virtualized.

        [Dale Carder] 15:05:00
        So there there still ends up being in like, you know, a a period of consultation before it gets the point of making dynamic.

        [Dale Carder] 15:05:08
        Api calls

        [Lindsey Gray] 15:05:10
        okay, yeah, as this will be a bit of a longer thought process.

        [Lindsey Gray] 15:05:15
        But I'm just wondering yeah, exactly how much of that can Billy really be automated?

        [Lindsey Gray] 15:05:20
        Because even in terms of like trying to understand what capacity you have at this, at the site or between 2 sites, that that's something that you could also figure out how to negotiate mostly automatically if you had some orchestration service dealing or dealing with both sites

        [Lindsey Gray] 15:05:37
        So huh? Alright, I'm gonna I'll chew on that some more. In probably send you an email.

        [Dale Carder] 15:05:43
        Yeah, also, and I think it's in the snow mass report on in the network section.

        [Dale Carder] 15:05:47
        I forget that was like Area 4. There's some some more stuff around that

        [Lindsey Gray] 15:05:48
        Okay.

        [Lindsey Gray] 15:05:54
        Alright great thanks.

        [Enrico Fermi Institute] 15:05:57
        Andy, you're in a comment

        [abh] 15:06:00
        yeah, there was a slide way back, I mean way back for today about what other services are needed to make cloud access.

        [abh] 15:06:09
        more efficient, better whatever, and I wasn't able to attend that particular section because I had a conflicting meeting.

        [abh] 15:06:18
        So I didn't want to mention a project of we're working on that that is called S.

        [abh] 15:06:23
        3 gateway. It was Big Thing is to be able to transition to the cloud easily by essentially allowing the current Hep security infrastructure, whether it be or tokens should be used with any cloud provider because with the S 3 gateway basically translates, all of

        [abh] 15:06:46
        those credentials applies the security that's necessary, and then does whatever you need to do in terms of uploading download to the cloud storage.

        [abh] 15:06:54
        Yeah. So that is in my mind, a a great way to branch the gap before we actually have to deal with the actual cloud provider security.

        [abh] 15:07:07
        so that people can start using this stuff relatively easily with their existing credentials.

        [Enrico Fermi Institute] 15:07:16
        Interesting. So I had a quick question about that. Andy does this only work in front of, you know a cloud provider?

        [Enrico Fermi Institute] 15:07:24
        Or would that be something that would work in front of Minio, for instance, that just provide Okay, awesome

        [abh] 15:07:28
        It definitely works in front of Menio

        [Enrico Fermi Institute] 15:07:34
        So does this help at all with the cloud certificate, authority type of problems?

        [Enrico Fermi Institute] 15:07:38
        Or is this

        [abh] 15:07:40
        not exactly. I mean, the problem is, it solves the problem perfectly for storage.

        [abh] 15:07:46
        Right doesn't quite solve the problem for compute access that you need something else for

        [abh] 15:07:54
        And I would assume Panda could handle that. But then, the panda theme would have a team would have to say something on that

        [Enrico Fermi Institute] 15:07:54
        Yeah.

        [Enrico Fermi Institute] 15:08:03
        But the compute is a bit easier, though right, because you know presumably the the experience, control, the environment that they run in so they can add whatever the ca authorities they want.

        [Enrico Fermi Institute] 15:08:14
        You know For instance, the if you, if you run the the Osg, you know regular grid.

        [Enrico Fermi Institute] 15:08:17
        C a cert bundle, or whatever that has. For instance, the in common.

        [Enrico Fermi Institute] 15:08:23
        I'm sorry, not in common, but the let's encrypt.

        [Enrico Fermi Institute] 15:08:26
        Ca: trusted right? So that that's a little bit easier.

        [abh] 15:08:26
        right, yeah, but I mean again it it It really depends on how that's integrated.

        [Enrico Fermi Institute] 15:08:28
        There.

        [abh] 15:08:34
        So I I don't know the specifics of on the compute side, and how that's done.

        [abh] 15:08:38
        So I can't really address that. But presumably, you know, like Panda, and equivalent systems, can provide the same kind of access

        [Fernando Harald Barreiro Megino] 15:08:49
        okay, for the integration with the cloud. What we do as we talk directly with the Google need to say Api: And But we have

        [Fernando Harald Barreiro Megino] 15:09:02
        Now. So this account Jason, into the harvesting machine, and and that's it.

        [Fernando Harald Barreiro Megino] 15:09:10
        Anything else.

        [Enrico Fermi Institute] 15:09:16
        Okay.

        [Enrico Fermi Institute] 15:09:20
        See, some hinders, Philip, did you guys have other comments or that's just cool.

        [Fernando Harald Barreiro Megino] 15:09:25
        no, I for

        [Enrico Fermi Institute] 15:09:28
        Okay? Yeah, done Okay?

        [Douglas Benjamin] 15:09:31
        Yeah, one thing I was wondering about is if

        [Douglas Benjamin] 15:09:37
        Right now at least for Atlas, the Hpcs and the Cloud Instances are what I'll call standalone in the sense of we are not flowing work.

        [Douglas Benjamin] 15:09:54
        We're not using either of which as an extension of our existing resources sort of a blending of on prem and off per in the same, let's say batch system for example, is that something we actually should be investigating for the future.

        [Enrico Fermi Institute] 15:10:17
        Are are you? Are you suggesting, like taking a tier?

        [Enrico Fermi Institute] 15:10:19
        2 and and adding cloud resources to it. And okay.

        [Douglas Benjamin] 15:10:23
        Something like that. I mean I know Dirk mentioned that they were sort of looking at that, and maybe it's not a tier.

        [Douglas Benjamin] 15:10:29
        2. Maybe it's it's in the analysis space.

        [Douglas Benjamin] 15:10:33
        Hello! What was shown with Reuben, though they're not doing on-prem off crime.

        [Enrico Fermi Institute] 15:10:37
        See.

        [Douglas Benjamin] 15:10:42
        In that they separated the science part to the cloud. Okay.

        [Enrico Fermi Institute] 15:10:45
        I mean. Personally, I think that in the analysis space cloud makes a lot of sense right?

        [Enrico Fermi Institute] 15:10:50
        Because you can get a lot of yeah, resources, any kind of resource you want.

        [Enrico Fermi Institute] 15:10:54
        You can get it very quickly if pay for it, obviously, but

        [Enrico Fermi Institute] 15:11:05
        I don't know if anybody else had other comments on that

        [Enrico Fermi Institute] 15:11:13
        Okay, other comments in general questions, other things, we miss

        [Enrico Fermi Institute] 15:11:23
        1 one. Meta question is, how well, did this format work for people

        [Douglas Benjamin] 15:11:28
        well, I have a question. There was a policy is that did that get Will that be talked about?

        [Enrico Fermi Institute] 15:11:30
        Yes.

        [Douglas Benjamin] 15:11:36
        Or is that you have a one line in the pilot on the old?

        [Douglas Benjamin] 15:11:40
        In the indigo that says policy, features.

        [Enrico Fermi Institute] 15:11:43
        yeah, I mean, I think in the, in the, in the code that was actually in the in the morning, right?

        [Enrico Fermi Institute] 15:11:49
        and that was just what what can we

        [Enrico Fermi Institute] 15:11:53
        You know one of those? What are the policies that we could ask for would make our lives easier

        [Douglas Benjamin] 15:12:01
        No cause. Some of the policies for the Hpcs can be rap can be very disruptive, and I know we didn't talk about it the Ls. The the incentives for the Lcs.

        [Douglas Benjamin] 15:12:11
        Right if we can work with doe and ultimately and that's for their biggest machines to really think.

        [Douglas Benjamin] 15:12:19
        Rethink the the policies that they, their incentives for what they, what there minders.

        [Douglas Benjamin] 15:12:30
        Hello! Them that might make might make these resources better for us

        [Enrico Fermi Institute] 15:12:36
        Yeah, yes, right?

        [Douglas Benjamin] 15:12:37
        Assuming we can run on them. Blah blah blah, So by that I mean as a very specific example, Argon Alcf schedules, the biggest jobs fastest from the smaller jobs, And yet, you know, we were able to atlas to you to run across the whole machine but we had to

        [Enrico Fermi Institute] 15:12:59
        Okay.

        [Douglas Benjamin] 15:13:01
        chop things up a little bit, which is a little different than you know the to different incentives.

        [Douglas Benjamin] 15:13:10
        And that's where we need help going up through Doe to talk 1 one office to the other.

        [Enrico Fermi Institute] 15:13:17
        Right.

        [Enrico Fermi Institute] 15:13:21
        Yeah, I mean, So how does that, you know? Get get translated into something that we We we put in the report right like we just say we you know we

        [Douglas Benjamin] 15:13:30
        So you do it as a finding, and then then maybe a recommendation.

        [Douglas Benjamin] 15:13:35
        If it's strong enough

        [Enrico Fermi Institute] 15:13:36
        Yeah.

        [Douglas Benjamin] 15:13:38
        That's how it gets translated into the report.

        [Dale Carder] 15:13:38
        ha! Having served on and say, ha! Having served on the review committees for, like some of the Lcs, it could be very interesting for you guys to read the annual reports.

        [Dale Carder] 15:13:49
        They write, for example, I'm sure Alcf has at least their 2021 report published, online and see the terminology they use, and how they describe.

        [Dale Carder] 15:14:00
        You know the essentially the the incentive structure they're given, and how they match jobs to the charge.

        [Dale Carder] 15:14:06
        They're given and phrasing things in those terms, I would certainly help

        [Enrico Fermi Institute] 15:14:17
        Yeah. This seems related to the comment that was made. I think, on Monday about.

        [Enrico Fermi Institute] 15:14:22
        You know we had to. We have to kind of sell it

        [Enrico Fermi Institute] 15:14:27
        In their terms

        [Dale Carder] 15:14:28
        yeah, it's so. And it's also there's you know, there's aside from just a difference between Htc.

        [Dale Carder] 15:14:34
        And Hpc. There's still sort of differences in language you know about how these workflows, you know, exist, and what resources they need, and why it should be a national imperative to make sure that.

        [Dale Carder] 15:14:49
        The resources are allocated for your Us. Case

        [Enrico Fermi Institute] 15:15:02
        other comments there

        [Enrico Fermi Institute] 15:15:08
        Other comments.

        [Douglas Benjamin] 15:15:08
        yeah, I think we phrase the language right and the reason, I say, that is because I had a conversation with the some activity I was doing Where That's where it came up.

        [Enrico Fermi Institute] 15:15:20
        Okay.

        [Douglas Benjamin] 15:15:20
        Good. If we can give the the Lc. Apps some coverage down, work with them to change the policy.

        [Douglas Benjamin] 15:15:29
        They really do want to have their machines be usable to a large cross-section of the science.

        [Douglas Benjamin] 15:15:35
        But they have as the way they're being judged against.

        [Douglas Benjamin] 15:15:43
        Until that changes they can't do things that make them much more usable for us.

        [Douglas Benjamin] 15:15:48
        So that's where we have to help them. Help us in.

        [Enrico Fermi Institute] 15:15:50
        Sure.

        [Douglas Benjamin] 15:15:51
        A report like this might be able to do that if it goes far, far enough up the food chain

        [Enrico Fermi Institute] 15:16:03
        Okay, Thanks. Time.

        [Enrico Fermi Institute] 15:16:07
        Al thoughts

        [Enrico Fermi Institute] 15:16:17
        Okay. Maybe I'll move on to the the last slide here.

        [Enrico Fermi Institute] 15:16:20
        So yeah, what happens next? I mean, first of all, you know, we we really appreciate everybody kind of slogging and out and doing this for for 3 days as a mix of mostly remote, and some some folks coming in person.

        [Enrico Fermi Institute] 15:16:32
        We really appreciate everybody's input, you know, I I think at this point, you know, our next steps would be to you know.

        [Enrico Fermi Institute] 15:16:39
        We'll we'll follow up with some folks after the workshop, I'm sure that will want to, pick your brains a bit.

        [Enrico Fermi Institute] 15:16:48
        More about specific areas of the report that we're writing.

        [Enrico Fermi Institute] 15:16:52
        yeah, we would really appreciate if you, you know, keep sending your feet back, keep sending us your thoughts on these topics, you know.

        [Enrico Fermi Institute] 15:17:01
        I I put some kind of tentative date here for a draft. You know.

        [Enrico Fermi Institute] 15:17:06
        We're going to plan to, you know, have a draft out by, you know, the first in November, and then our our report will be finished by December.

        [Enrico Fermi Institute] 15:17:15
        First. Yeah, Yeah, So thank you. You know any of the other folks had comments.

        [Enrico Fermi Institute] 15:17:27
        Stir, no no.

        [Dirk Hufnagel] 15:17:33
        No, I think I just wanna maybe add that, thanks to everyone that participated in the discussion, we are very appreciative of that was, we got some good discussion going now, we Just we have to go through the notes summarize get a report, maybe get more input If there's more input, follow up.

        [Enrico Fermi Institute] 15:17:50
        Yeah, yeah, the folks want to to help write pieces, you know, I think that's a that would be okay.

        [Dirk Hufnagel] 15:17:51
        With some people.

        [Dirk Hufnagel] 15:17:59
        There's still some hands up, so maybe there are some last minute comments

        [Enrico Fermi Institute] 15:18:00
        Yeah, indeed. Eric.

        [Eric Lancon] 15:18:03
        yes, I wanted to congratulate you, too, for organizing this workshop.

        [Eric Lancon] 15:18:09
        When I saw the Ad. I said I thought it would be mission impossible to, and bet you did it.

        [Eric Lancon] 15:18:18
        Oh, and it was very interesting, very fruitful discussions, Congratulations!

        [Eric Lancon] 15:18:24
        Thank you very much

        [Enrico Fermi Institute] 15:18:25
        Thank you. Tony.

        [Tony Wong] 15:18:28
        So you talk about keep sending us your feedback. Where should that feedback be?

        [Tony Wong] 15:18:33
        Just to send it to your own personal email is that how it works

        [Enrico Fermi Institute] 15:18:36
        Yeah, you know, I thought about that when I when I wrote that bullet point in the slide, I guess I should have put the put the email addresses on there, or put it's a mailing list or something Yeah, maybe Maybe we'll follow up.

        [Enrico Fermi Institute] 15:18:48
        With another mail, you know, with with how folks could could you keep contributing, and how we can keep the conversation going

        [Enrico Fermi Institute] 15:19:01
        Other comments.

        [Paolo Calafiura (he)] 15:19:05
        just want to echo what Eric said. Great job, guys

        [Enrico Fermi Institute] 15:19:10
        Hey? Thank you very much. Okay? Well, thanks. Everybody. Yeah, We'll follow up.

        [Enrico Fermi Institute] 15:19:16
        Appreciate your time.

        [Lindsey Gray] 15:19:17
        Thank you.

        [Dirk Hufnagel] 15:19:19
        hey? Thanks bye.

        [Enrico Fermi Institute] 15:19:22
        Hi! Everyone.

      • 13:40
        Discussion 1h 20m