[Eastern Time]
More R&D and Discussion
1321
15:39:28,910 --> 15:39:30,250
Enrico Fermi Institute: Um.
1322
15:39:30,800 --> 15:39:36,420
Enrico Fermi Institute: So the next section. This kind of leads in the next section. We wanted to talk a little bit about
1323
15:39:36,490 --> 15:39:38,910
Enrico Fermi Institute: R. And d efforts.
1324
15:39:41,730 --> 15:39:44,150
Enrico Fermi Institute: Now we've covered some of this already.
1325
15:39:46,490 --> 15:39:50,170
Enrico Fermi Institute: Um, Dirk, Did you want to say a couple of things about this?
1326
15:39:50,440 --> 15:40:03,390
Dirk: Yeah. And that that comes through? This comes directly i'll look at on the comes directly from a question that's in the charge Where, basically, ask us, is there anything we can do on the on the site, or that
1327
15:40:03,670 --> 15:40:05,530
Dirk: that is needed to
1328
15:40:05,900 --> 15:40:09,369
Dirk: to what we can do needed to expand
1329
15:40:09,590 --> 15:40:23,570
Dirk: the range of what we can do on commercial Cloud and Hpc. All increase the cost effect on us, which kind of goes hand in hand. And uh, we already talked a little bit about Lcf integration and the Hpc. Focus area that there's
1330
15:40:23,640 --> 15:40:27,459
Dirk: work to be done on the Gpu workloads, which is
1331
15:40:27,810 --> 15:40:35,630
Dirk: somewhat out of scope for this conference, because we're not for this workshop because we're not supposed to talk about framework on software development.
1332
15:40:35,680 --> 15:40:52,100
Dirk: Um, But then there's also integration work. We talked a little bit about this on the cost side that it's a bit at this point uh like estimating Lc. F Long term operations. Cost is a bit hard because the integration is not fully worked out.
1333
15:40:52,170 --> 15:41:01,009
Dirk: Um software delivery kind of during the Hbc. Focus for every kind of agreed that even if it's is everywhere,
1334
15:41:01,020 --> 15:41:12,510
Dirk: and then there's at services which is also every Hpc. Seems to do their own thing and what they support. They all want to support it, but they kind of have different solutions in place,
1335
15:41:12,540 --> 15:41:15,390
Dirk: and it's also to me at least a bit unclear
1336
15:41:15,420 --> 15:41:20,420
Dirk: with the long-term operational needs there on this area.
1337
15:41:20,900 --> 15:41:28,610
Dirk: And then we already talked a little bit about dynamic cloud users. Uh, which means basically you you. You do like your whole.
1338
15:41:28,750 --> 15:41:44,449
Dirk: The whole processing chain inside the cloud uh phenomena talked about that a little bit because it to reduce e egress charges. We basically you copy and your input data once or and then do multiple processing runs on it and
1339
15:41:44,460 --> 15:41:56,950
Dirk: only keep the the end result basically and forget about the the intermediate output, and then you save one. You don't have to get it out. You only have to get the smaller final output. We already talked about machine learning.
1340
15:41:58,040 --> 15:41:59,560
Dirk: And then uh,
1341
15:42:01,030 --> 15:42:20,909
Dirk: there's uh on d work on on different architectures to be able to support this uh which opens up possibilities in both Hpc. And Cloud use uh Fpga's um various Gpu types that feeds into the Gpu workloads, but it's not exclusive to just uh
1342
15:42:21,080 --> 15:42:32,130
Dirk: uh Gpu workloads, because it could also be machine learning like, How how do we integrating machine learning to make use of these new architectures? And that's gonna be
1343
15:42:32,750 --> 15:42:35,820
Dirk: integration on D. But also basic uh
1344
15:42:35,910 --> 15:42:41,970
Dirk: basic on the on, on some of these topics. And then there was some
1345
15:42:42,710 --> 15:42:50,129
Dirk: things that we're kind of playing around with unique that are unique to the cloud where they offering platforms that we
1346
15:42:50,460 --> 15:43:07,240
Dirk: kinda that's hard to replicate in-house uh like there's, some a big, very big table experiments function as a service. I don't know too much about it. We just threw it on here. Maybe Lindsay or Mike could say something about that, or someone else that's more familiar with that.
1347
15:43:10,780 --> 15:43:17,670
Paolo Calafiura (he): I won't say that i'm familiar with functions as a service. But I I just want to mention that this is also
1348
15:43:17,690 --> 15:43:30,329
Paolo Calafiura (he): um an area important for Hpc's. Then they are developing. They are developing a function that they probably the same at the same framework, the Funkx framework. Yes,
1349
15:43:30,340 --> 15:43:48,699
Paolo Calafiura (he): and uh that there is that there is apparently a a sol in for to to the main Lcs of of of bunkx using something called. So. This is something we are very interested in to a Cc. Is a possible joint project across the
1350
15:43:48,710 --> 15:44:05,420
Enrico Fermi Institute: so I guess from personal experience. Uh, we actually quite routinely use parcel uh for farming out the analysis jobs. Uh, and at some point back in the day there was a proof of concept
1351
15:44:05,430 --> 15:44:11,389
Enrico Fermi Institute: using a Fung X endpoint and doing analysis jobs with that
1352
15:44:11,420 --> 15:44:36,939
Enrico Fermi Institute: um. So all of the groundwork for that is actually been laid out. Um! And we could return to using that. We just ended up using desk a little bit more prevalent prevalently. But it's also something that's up up to the user at, or that we left up to the user at the end of the day, and if we want to develop more infrastructure around that uh it, we have a basis to start prompt
1353
15:44:36,950 --> 15:44:53,969
Enrico Fermi Institute: uh as far as like going to like production workflows or reconstruction, or something like that. I don't think that's been explored at all. Um, but it's it. It looked really promising and interesting from the analysis uh analysis view of things.
1354
15:44:53,980 --> 15:45:07,179
Enrico Fermi Institute: And I think at the time it was just a little bit immature compared to where things have gone more recently for bigquery and big table. I think this is actually
1355
15:45:07,960 --> 15:45:21,790
Enrico Fermi Institute: uh, right. This is this. This was studied by Gordon Watson Company, and they did a a couple of benchmarkings of what the performance per dollar was for analysis, like queries on
1356
15:45:21,830 --> 15:45:26,670
Enrico Fermi Institute: data sets backed by various engines,
1357
15:45:27,330 --> 15:45:44,309
Enrico Fermi Institute: and we could go and take a look at that paper. But the gist of it was that bigquery and big table are uh not nearly as cost-efficient as uh using Rdf. For instance, or or coffee um, or well, awkward, or a plus uproot, for instance.
1358
15:45:44,320 --> 15:46:01,499
Enrico Fermi Institute: So there's already some demonstrations that while these offerings are there, they're not quite up to the performance that we can already provide with our home grown tools. But maybe this also provides a uh way to talk with the bigger cloud services and say, Hey,
1359
15:46:01,510 --> 15:46:06,510
Enrico Fermi Institute: this is the kind of performance we need. Can we do any? And beenance matching here?
1360
15:46:08,310 --> 15:46:15,509
Dirk: What? Sorry That was a bit of an information. No, it's fine. But the thing is this is all
1361
15:46:16,260 --> 15:46:27,480
Dirk: the question. One basic question I had about this is while while some of these areas that are being worked on can provide quite a a great
1362
15:46:27,500 --> 15:46:34,020
Dirk: improvement and user experience like. And at the analysis level you you just
1363
15:46:34,090 --> 15:46:40,670
Dirk: yeah to to what extent are the applicable. If you look at like a global picture of
1364
15:46:40,730 --> 15:46:48,980
Dirk: experiment resource use, I mean that because the individual user experience doesn't necessarily mean you. You save a lot of
1365
15:46:48,990 --> 15:47:02,780
Dirk: resources overall, but you you can make life easier for your users, and you improve the physics output, and that's all great. It's just um in terms of looking at that application of
1366
15:47:02,950 --> 15:47:09,230
Dirk: of of money in terms is Is this a large enough area that that we have to
1367
15:47:10,220 --> 15:47:23,680
Enrico Fermi Institute: how prominently should we put it into the report? Basically, that's what i'm trying to get
1368
15:47:23,690 --> 15:47:37,939
Enrico Fermi Institute: as you make things more scalable, so that the folks can like you know, do the the first exploratory bits of their analysis from their laptop, and then scale that seamlessly into the cloud with Funkac, or whatever do that we admit
1369
15:47:38,120 --> 15:47:55,869
Enrico Fermi Institute: um, if you can make it so that those first exploratory steps are at less scale, then of course, that means that the resource usage, as you scale up more and more, is going to be much more uniform between all the users that you one
1370
15:47:55,880 --> 15:48:08,630
Enrico Fermi Institute: uh have engaging with the system, which means you can probably schedule it all a little bit better, as far as you know, which I think is, is another way of saying. You know you just make things nicer for the users. Um,
1371
15:48:08,640 --> 15:48:28,579
Enrico Fermi Institute: but it one. It means that. Uh, uh, we are figuring out a schedule all that becomes easier, which means it becomes, uh, more efficient from your perspective, or from the operational perspective, I would say. And then, uh, it also changes the way in which people
1372
15:48:28,590 --> 15:48:51,250
Enrico Fermi Institute: uh compete for resources at clusters, because all the analysis start looking more and more the same. Um. And they also start reaching the larger resources at a higher level of maturity than perhaps what you see even nowadays. Sometimes people just run stuff and see what happens. And it's very, very experimental, software let's say
1373
15:48:51,260 --> 15:48:54,349
Enrico Fermi Institute: um. So I I
1374
15:48:54,520 --> 15:49:00,139
Enrico Fermi Institute: to answer your question of like, is this big enough to care?
1375
15:49:00,760 --> 15:49:15,249
Enrico Fermi Institute: I have a feeling that right now it is big enough to care, and the fact that we're getting more data is going to keep it in the regime of being big enough to care and report and make sure that we actually make a special or treat this,
1376
15:49:15,260 --> 15:49:40,909
Enrico Fermi Institute: at least in a special way, because the resource uses pattern is wildly different from production. Um! But as we roll out these uh the things like functions as a service, or uh figure out how to scale a column or analysis, and our data frame effectively uh it's going to mink the competition. Or yeah, it's going to make the usage of resources less and easier to manage, which is kind of good for us.
1377
15:49:40,920 --> 15:49:53,019
Enrico Fermi Institute: But also uh it's not going to make it a bigger piece of the competition for all the computing resources, So that's what it sort of looks like in my mind kind of extrapolating from what we have right now. One hundred and fifty.
1378
15:49:53,070 --> 15:50:12,099
Enrico Fermi Institute: Uh, I think The answer then is, uh, we need. We need to watch it and see what these systems that are just starting to come online actually do for resource usage in uh, even if it's not at scale and see if it does bring kind of this evening out of of competition for resources at tier two is
1379
15:50:12,110 --> 15:50:15,289
Enrico Fermi Institute: um and otherwise making the analysis,
1380
15:50:15,620 --> 15:50:21,180
Enrico Fermi Institute: analysis, computing usage a bit more, even as far as
1381
15:50:21,370 --> 15:50:25,670
Enrico Fermi Institute: sorry, even as far as Job submission goes. And things like that.
1382
15:50:25,860 --> 15:50:29,870
Enrico Fermi Institute: That's that's sort of my view. I I of course.
1383
15:50:30,000 --> 15:50:38,340
Enrico Fermi Institute: Yeah, this is trying to predict the future. So other people please feel free to predict the future, too, and we can see what works
1384
15:50:39,280 --> 15:50:57,220
Paolo Calafiura (he): always always very informative to hear to hear from you Parents? Uh, I I'm certainly not nearly as competent, and I know that are more competent people in the call who may want to chime in. But uh, our interest from the Cc. Sign
1385
15:51:05,270 --> 15:51:24,750
Paolo Calafiura (he): complex Enough that the paradox. And by the way, Derek, you yesterday we heard that that the Cms. Cms uh is um sort of fighting against the provisioning, challenging the provision challenges, you know, creating workers with the with the right.
1386
15:51:24,760 --> 15:51:28,160
Paolo Calafiura (he): Uh, we divide the capabilities.
1387
15:51:28,170 --> 15:51:50,549
Paolo Calafiura (he): Uh, you know to some extent that I don't know which has since, because i'm in combat, that these issues have been addressed by the by, the by, the folks with developed parts of So some of those issues uh have made the Atlas think that far so it could be a good back end for some of our existing code in in this sort of
1388
15:51:50,560 --> 15:51:56,159
Paolo Calafiura (he): and I I I i'm hoping that somebody has more competent jump.
1389
15:51:57,290 --> 15:52:13,480
Enrico Fermi Institute: Um! The only thing that I can tack on to that is that uh Anna and Company back in the day uh figured out how to make a back filling system uh using funkx and parcels. So that's that's definitely something that works
1390
15:52:13,530 --> 15:52:29,769
Enrico Fermi Institute: Um, and you can, and that's also what the guys at Nebraska are doing with the last or with the the coffee Casa analysis facility as they're back filling into the production jobs. So for sure, this is a pattern that works, and that people can implement. But,
1391
15:52:29,780 --> 15:52:34,630
Enrico Fermi Institute: uh, we also we don't. We don't know how it how it scales out uh
1392
15:52:34,750 --> 15:52:43,950
Enrico Fermi Institute: you know, to more and more data and more and more users. The The usage right now, I would say, is fairly limited. And yeah, that's
1393
15:52:45,020 --> 15:52:50,759
Enrico Fermi Institute: I. I think that helps at some context. But we definitely need to hear from more people on this,
1394
15:52:51,470 --> 15:52:59,310
Dirk: hey? Maybe just one comment that Jeff, we're primarily interested in production here. But, on the other hand, analysis takes over
1395
15:52:59,610 --> 15:53:06,270
Dirk: half our resources or half the tools, at least, so there's a significant fraction. So if analysis gets easier you
1396
15:53:06,690 --> 15:53:13,279
Dirk: that means maybe there's more resources for production to use just as a quick correction. It's only a quarter dirk.
1397
15:53:13,390 --> 15:53:18,340
Dirk: Oh, it's a quarter of it. You I thought it's half the T. Choose. Now it's a quarter
1398
15:53:18,530 --> 15:53:20,280
Dirk: that's a quarter. Now. Okay,
1399
15:53:20,350 --> 15:53:28,460
Enrico Fermi Institute: yeah, as a as more production just shows up the the the fraction gets smaller and smaller.
1400
15:53:33,200 --> 15:53:46,199
Enrico Fermi Institute: But Yeah, there, I mean, just thinking about it more. There's also this rather severe impedance mismatch, at least right now, with the kind of the can. The cadence of analysis jobs versus uh production cops,
1401
15:53:46,210 --> 15:53:55,879
Enrico Fermi Institute: since it's much more bursty and short-lived as opposed to a production job that comes in, and you know it's going to use twenty four hours in a slot or something like that.
1402
15:53:56,180 --> 15:54:02,060
Enrico Fermi Institute: So it's by it. By its very nature it's a much more adaptive
1403
15:54:02,510 --> 15:54:06,890
Enrico Fermi Institute: and reactive scheduling problem.
1404
15:54:20,280 --> 15:54:28,630
Enrico Fermi Institute: So one of the things that we mentioned with the cloud offerings, I mean, we had a couple of examples. There are big, very big table functions of the service.
1405
15:54:28,650 --> 15:54:47,950
Enrico Fermi Institute: One of the questions I had at least, was it. Is there anything i'm missing right like on the cloud? Right? Because if you go and look at the service catalog for something like aws. It has this humongous, you know, spread of, of of things that they can services that they offer. Uh, is there anything that we're
1406
15:54:47,990 --> 15:54:49,940
Enrico Fermi Institute: leaving on the table that we should
1407
15:54:50,600 --> 15:54:51,950
Enrico Fermi Institute: you should look into?
1408
15:54:55,200 --> 15:54:59,800
Enrico Fermi Institute: Uh, I'll say that something that's interesting.
1409
15:55:00,150 --> 15:55:18,890
Enrico Fermi Institute: Maybe not. Maybe not just for uh clouds, but also for sort of on premises. Facilities is uh things like sonic that lets us sort of um disaggregate the gpus and the cpus. So if you're doing inference, you might not need a whole Gpu. But
1410
15:55:18,900 --> 15:55:27,490
Enrico Fermi Institute: you know, as someone you know, either you buy very expensive, Let's say in the cloud case. Let's just stick that. So, you know you might have to buy. You might be buying a bunch of Gpu nodes
1411
15:55:27,500 --> 15:55:39,980
Enrico Fermi Institute: uh which are many times more expensive. But you know, if the reconstruction path only needs a quarter of a gpu being able to independently scale up the number of gpus and cpus that you're running at a time. Um,
1412
15:55:39,990 --> 15:55:51,770
Enrico Fermi Institute: it's something useful. And then, like I mentioned like for on premises stuff, too, because you can stick either two or four of these gpus into a box. But if the core count is two hundred and fifty-six on the node, then
1413
15:55:52,010 --> 15:55:54,990
Enrico Fermi Institute: you you better hope that the the
1414
15:55:55,060 --> 15:56:01,679
Enrico Fermi Institute: the fraction of time that you're spending a gpu and the speed up that you get, you know, and dolls, law and all that actually makes it worthwhile
1415
15:56:12,330 --> 15:56:13,160
you.
1416
15:56:19,070 --> 15:56:38,129
Enrico Fermi Institute: Yes, and and going on to that like there's also going there is going to be uh, and there already is, and it will be an ever growing class of analysis user, that is asking for Gpus, too, and you have to again deal with this very different rate of scheduling resources for them.
1417
15:56:38,430 --> 15:56:55,730
Enrico Fermi Institute: Um, and sometimes there the amount of, or at least the the the burstiness of the data processing that they're trying to do on that Tv is much much higher compared to like a production job. Even if the resource, the total resources, are much higher on the production side, just because of job multiplicity
1418
15:56:55,740 --> 15:57:19,540
Enrico Fermi Institute: that you have users that are, you know, just poking around doing their exploratory stuff, and right now we give them a whole T four. Well, t four per hour is not cheap, not cheap at all. So and you'll have people like training models and then loading it onto a T for running their running, running their whole signal data set, or something like that, to see what it looks like in the tails, et cetera, et cetera, or running it on their backgrounds.
1419
15:57:19,580 --> 15:57:24,290
Enrico Fermi Institute: And it's still the same problem of needing to
1420
15:57:24,450 --> 15:57:42,980
Enrico Fermi Institute: uh, very piecemeal uh schedule your gpus, and then on top of that schedule, all the networking between them, because you have this really insane burst of uh inference requests for a very short amount of time that you need to negotiate on your network to not net or not mess with everyone else's jobs.
1421
15:57:43,170 --> 15:57:44,580
Enrico Fermi Institute: So
1422
15:57:44,620 --> 15:57:54,399
Enrico Fermi Institute: it might not be. It might not be a huge what you said It's a quarter of the tier two right now. It's. Let's say it just stays a quarter of that. But the
1423
15:57:54,590 --> 15:58:09,069
Enrico Fermi Institute: the the way that it's going to be using the resources if it's that bursty may not look like a quarter at certain points in time during the analysis workflow, and that's something we have to be ready to deal with.
1424
15:58:09,370 --> 15:58:13,230
Enrico Fermi Institute: I have no idea how to actually schedule that.
1425
15:58:13,490 --> 15:58:14,539
Mhm
1426
15:58:19,200 --> 15:58:23,320
Enrico Fermi Institute: So so we're almost at the top of the hour.
1427
15:58:23,800 --> 15:58:28,420
Enrico Fermi Institute: So any other topics that we wanted to hit before we wrap up for the day.
1428
15:58:41,590 --> 15:58:47,809
Enrico Fermi Institute: So I think logistically, we were going to tomorrow. Talk a little bit about.
1429
15:58:49,090 --> 15:58:54,949
Enrico Fermi Institute: See? In the morning I think we were going to talk about accounting and pledging.
1430
15:58:55,240 --> 15:58:57,530
Enrico Fermi Institute: We're going to talk about some, you know.
1431
15:58:57,840 --> 15:59:14,780
Enrico Fermi Institute: Facility, features, policies. How did a discussion about security topics when it comes to Hpc. And Cloud. Um: Yeah. Allocations, you know, planning that sort of thing, I think, in the afternoon,
1432
15:59:14,790 --> 15:59:18,350
Enrico Fermi Institute: and have a a presentation from the
1433
15:59:18,520 --> 15:59:22,869
Enrico Fermi Institute: from the Vera Rubin folks to talk about their experiences.
1434
15:59:23,700 --> 15:59:42,449
Enrico Fermi Institute: And then, yeah, some summary type of work and and just you know other other topics or observations that people would like to bring up. So I mean, if there's something that that hasn't that we haven't hit on the agenda that people would really like to talk about um tomorrow afternoon. It'd be a really good time to to bring that
1435
15:59:47,150 --> 15:59:49,349
Enrico Fermi Institute: anything else from anyone.
1436
15:59:55,150 --> 16:00:00,209
Enrico Fermi Institute: Okay, sounds like, Not all right, Thanks, everybody. We'll talk to you tomorrow.
1437
16:00:01,790 --> 16:00:03,559
Fernando Harald Barreiro Megino: Hi. Thank you.