[Eastern time]

 

929
14:37:45,550 --> 14:37:49,740
Dirk: So figure out how to use this. So we were at impact. Right?

930
14:37:49,770 --> 14:37:50,810
Enrico Fermi Institute: Yeah,

931
14:37:50,880 --> 14:37:56,380
Dirk: I I can say a little bit something about that. And I think we we discussed some of that yesterday already,

932
14:37:56,630 --> 14:38:01,189
Dirk: but it's it's also including cloud. Now. So we're looking at both. Um.

933
14:38:01,450 --> 14:38:19,450
Dirk: So what happens if we actually start using a lot of Hpc. And cloud users uh how the integration with I mean, at the moment we run them opportunistically. So they are considered an add on. But if we ever get to a point where they like a large fraction of our overall resources.

934
14:38:19,660 --> 14:38:37,420
Dirk: What's the impact on our on our global computing infrastructure? And how does it impact our own, the owned resource that that are still in the mix. So it's basically you. You can look at that. And you basically would have a lot of compute external to our own resources in some way.

935
14:38:37,470 --> 14:38:43,410
Dirk: And then you you look at What does that mean for our own sides? What kind of changes

936
14:38:44,440 --> 14:39:02,069
Dirk: might potentially be needed there to facilitate large scale cloud, and to a large degree that will dispense on on on how much are we actually using the storage at Cloud to the Hpc. So if you, if you consider that you don't have any storage, and you have to stream, or some other way to get the data

937
14:39:02,110 --> 14:39:09,129
Dirk: in and out quickly and just process it on demand that that puts more pressure on our own sides. Well versus

938
14:39:09,150 --> 14:39:22,100
Dirk: if you look at what? Atlas, that they have a self-contained site. That's more that follows more. The the model of just bring up another side somewhere else on some external resources. But it's kind of mostly a self-contained sun.

939
14:39:22,530 --> 14:39:23,539
Dirk: Um!

940
14:39:23,700 --> 14:39:42,489
Dirk: The other impact is that if you want to, if if we decide tomorrow, for instance, that um our codes performs great on arm, and we should uh switch to it as much as possible, because it's more cost effective. You can actually do that much quicker on the cloud, like, for instance, for that Google side in principle

941
14:39:43,130 --> 14:39:59,530
Dirk: at this could decide tomorrow that oh, from now on we're providing arm, cpu or not into cpu anymore. Um, because you change the instance type. Um! You can't do that on our own resources. That's a much longer process of a multiple years to swap out resources.

942
14:39:59,560 --> 14:40:01,730
Dirk: And uh, yeah,

943
14:40:01,830 --> 14:40:21,730
Dirk: And the the other obvious issues is even if we get storage at the Uh Cloud Hbc: sites you have to uh worry about transfers, because all these these resources need to be integrated in our transfer infrastructure. We need to, uh have Rosie be able to connect somehow. Uh, maybe uh

944
14:40:22,480 --> 14:40:23,449
Dirk: have

945
14:40:24,520 --> 14:40:35,929
Dirk: it. It mentions in intermediary node services. I know B. And L. Has some global online endpoint that atlas to facilitate transfers to some Hbc. And things like that, so

946
14:40:36,540 --> 14:40:52,250
Dirk: that feeds directly into the last point, feeds directly into network integration. So it's not just the transfer services, but also the underlying transfer fabric, the the network connectivity of of the Cloud and Hpc. Sites on Hbc. Resources.

947
14:40:57,530 --> 14:41:07,960
Dirk: As I said, we discussed it yesterday, some of it already. And uh, the one comment was that we should break out hardware and service costs that are basically

948
14:41:08,930 --> 14:41:11,800
Dirk: so anything else, any other comments on this

949
14:41:17,770 --> 14:41:20,830
Enrico Fermi Institute: one of the things that we had talked about in our,

950
14:41:21,430 --> 14:41:33,769
Enrico Fermi Institute: you know, just discussions among the blueprint group. Uh, you know, before the the workshop here was Is there any impact on

951
14:41:34,040 --> 14:41:46,960
Enrico Fermi Institute: on grid sites? If we were to, you know, do something like shift, you know large amounts of certain kinds of workflows to cloud like we did a lot of,

952
14:41:46,980 --> 14:41:53,320
Enrico Fermi Institute: you know, a lot more simulation on which Pc. We have to, you know,

953
14:41:53,540 --> 14:42:01,009
Enrico Fermi Institute: with the with the tier twos run correspondingly more analysis or something like that? If that were the case, would they have to

954
14:42:01,330 --> 14:42:04,189
Enrico Fermi Institute: up their facilities in certain ways,

955
14:42:05,200 --> 14:42:12,550
Enrico Fermi Institute: or does that not make sense at all. Should we just anticipate that We'll be able to run all workload types and all all resources,

956
14:42:13,990 --> 14:42:15,060
things like that?

957
14:42:16,390 --> 14:42:19,609
Enrico Fermi Institute: I see There's a hand raised from Eric

958
14:42:34,640 --> 14:42:37,089
Eric Lancon: to export um

959
14:42:37,220 --> 14:42:39,349
Eric Lancon: the Cpu processing

960
14:43:17,160 --> 14:43:18,999
Eric Lancon: at the same site.

961
14:43:23,900 --> 14:43:40,379
Dirk: Yeah, that's something we we we worried about because the the impact on the data transfers for formula specifically, because if you look at how we designed the the Hep cloud where we basically treat the Hpc. As an external compute resource, and then most of the

962
14:43:40,540 --> 14:43:53,389
Dirk: the Dio and the data it actually goes through Fermi lab that this will. So far everything is holding up nicely, but eventually, as we scale up Hpc: use. There's probably going to be an impact on

963
14:43:53,480 --> 14:43:58,259
Dirk: on provisioning of of network and storage at at formula

964
14:44:22,190 --> 14:44:23,250
Um.

965
14:44:23,340 --> 14:44:25,430
Enrico Fermi Institute: Other comments on

966
14:44:25,660 --> 14:44:30,349
Enrico Fermi Institute: impacted Hpc Cloud use on the existing infrastructure.

967
14:44:37,250 --> 14:44:39,300
Steven Timm: I just say what I heard

968
14:44:39,420 --> 14:44:42,249
Steven Timm: you might not think about.

969
14:44:42,480 --> 14:44:43,560
Steven Timm: Uh.

970
14:44:43,730 --> 14:44:48,970
Steven Timm: This was not a Cms. To us. This was, but we were running a very

971
14:44:49,080 --> 14:44:58,119
Steven Timm: um. And then, calling with the Google Code for a newference server, we managed to saturate the same network before we live for short time between us and Google.

972
14:44:59,330 --> 14:45:02,110
Steven Timm: So uh, you can.

973
14:45:02,280 --> 14:45:06,529
Steven Timm: If you're doing inference, you have to be careful of your um.

974
14:45:17,080 --> 14:45:29,849
Enrico Fermi Institute: I have what is possibly a profoundly uninformed question, how much of our, how much of our Monte Carlo generation at the at the actual generator level is being

975
14:45:29,860 --> 14:45:38,420
Enrico Fermi Institute: done uh or well. But what is uh taking place on Gpus like using using Gpus to do the Monte Carlo integration, and i'm waiting

976
14:45:40,460 --> 14:45:56,779
Enrico Fermi Institute: because that is a significant fraction of time that we spend right now. I mean, that's what uh Alison and Cms zero, because uh, I mean a very quick search on the Internet informs us that uh,

977
14:45:56,790 --> 14:46:15,759
Enrico Fermi Institute: they could, or so one that Gpu and Monte Carlo integration has been around for ten more than ten years now, and to that the factors speed up for that integration is like a factor of fifty or something. Uh, though of course, this probably depends on the shape of the thing that you're integrating, and how many polls it has and whatnot.

978
14:46:15,870 --> 14:46:24,899
Enrico Fermi Institute: But has anyone looked at benchmarking that, And could it have a major impact if we could significantly reduce the

979
14:46:25,020 --> 14:46:28,389
Enrico Fermi Institute: the time to integrating

980
14:46:28,420 --> 14:46:37,380
Enrico Fermi Institute: time to getting an integrated cross-section, and then also the time to unleading the necessary whatever necessary amounts of events.

981
14:46:37,460 --> 14:46:48,019
Enrico Fermi Institute: And could that fit on the hpc resources better. Could we use that in any way? I'm not sure that after that this goes really open-ended but it seems like It's something we're not considering

982
14:46:48,150 --> 14:46:54,739
Enrico Fermi Institute: because it's a It would be a really nice way to hide a lot of the latency in our production workloads right now,

983
14:46:55,120 --> 14:46:57,210
Enrico Fermi Institute: so get rid of it, not even Highland

984
14:47:00,660 --> 14:47:13,370
Enrico Fermi Institute: Uh. Had. Yeah, this this was a really open, ended question. But have we have we looked at that? And uh, if we're not doing it now. After ten years There must be something wrong,

985
14:47:13,420 --> 14:47:20,730
Dirk: maybe, Lindsay, but you and Mike should be in the best position to be able to answer that question in terms of

986
14:47:21,190 --> 14:47:25,329
Enrico Fermi Institute: for something that is that old.

987
14:47:25,360 --> 14:47:31,520
Enrico Fermi Institute: There's either something wrong with it, or we've actually just not been paying attention to it for a decade.

988
14:47:31,530 --> 14:47:46,870
Enrico Fermi Institute: Um. So I have. Yeah, and I I I personally don't have any information on that, Mike. Do you have anything? I think the answer is zero as well, you know. So why why are we using this? That's kind of a weird one?

989
14:47:47,170 --> 14:47:49,719
Steven Timm: There's been studies recently that almost

990
14:47:49,890 --> 14:48:01,270
Steven Timm: the dominant part of generation is actually throwing the dates and rolling around numbers. But I don't know if that's true for him as a but I know it's through for doing so. I mean, could you envision a situation where you're

991
14:48:01,280 --> 14:48:16,659
Enrico Fermi Institute: It's either. So any random numbers for you and nothing else. Uh yeah, I mean, that's that's probably what a large portion of it is that they're throwing lots of random numbers in parallel. Um! They have very good money, or very good uh Rngs uh for for Gpus.

992
14:48:17,070 --> 14:48:35,200
Dirk: I I think the the question also goes a little bit out of scope, because we're not supposed to look into what's going on on the framework side and the software side, but maybe to to to to, and I mean from from the conversation I had with Muddy on a lot of the the We had this the effort to spend in terms of Gpu.

993
14:48:35,210 --> 14:48:39,789
Dirk: I think it's this: The simple answer is, we looked at the full chain.

994
14:48:40,240 --> 14:48:51,369
Dirk: Jen Sim, did you recall, plus whatever miscellaneous comes after? And then they decided that generation is not the primary target of

995
14:48:51,730 --> 14:49:02,819
Dirk: a porting effort, because it's not over all that important for us. It's less important than reconstruction and tracking. I mean It's just lowest hanging fruit,

996
14:49:03,210 --> 14:49:16,139
Dirk: and the picture changes, of course, depending on generator to generator. But that I think that's That's the simple answer. No effort. Focus on certain areas, and that's one of one of them that wasn't focused on.

997
14:49:16,150 --> 14:49:34,669
Enrico Fermi Institute: Yeah, I can see that. That's a reasonable answer, I guess. Uh looking at kind of the the shape of the compute facilities that we are getting from Hpc. Uh. Packaging up some huge job that you uh, you know, send out to an Hpc. And then get your a lot of time and get your answer back. Uh. It seems

998
14:49:34,680 --> 14:49:46,849
Enrico Fermi Institute: at least in terms of like the the the geometry or the topology of the that makes a lot more sense for the kind of resources we're talking about. But I understand that Rico is certainly a higher priority in terms of compute that

999
14:49:49,940 --> 14:49:53,110
Enrico Fermi Institute: that's that's sort of where my thinking is heading, that's all,

1000
14:49:57,580 --> 14:49:59,379
Enrico Fermi Institute: Steve. Did you have another comment?

1001
14:50:00,440 --> 14:50:01,420
Steven Timm: No

1002
14:50:09,860 --> 14:50:13,999
Enrico Fermi Institute: other comments here. Or should we move on to to network integration?

1003
14:50:21,620 --> 14:50:24,489
Enrico Fermi Institute: Okay, Sounds like we should we should move on