[Eastern Time]
14:00:29,710 --> 14:00:34,679
Enrico Fermi Institute: I think this is the last session that's focused exclusively on cloud.
616
14:00:36,900 --> 14:00:37,920
Yeah.
617
14:00:38,670 --> 14:00:44,219
Enrico Fermi Institute: In the next session we'll talk about some R and D things and and networking. So
618
14:00:52,660 --> 14:00:57,720
Enrico Fermi Institute: okay, so maybe we break here and we'll we'll uh see everybody at one o'clock central time,
619
14:00:58,540 --> 14:01:00,130
Fernando Harald Barreiro Megino: so you know
620
14:01:01,310 --> 14:01:02,620
Enrico Fermi Institute: she learning,
621
14:01:03,820 --> 14:01:09,699
Enrico Fermi Institute: and then we'll we'll go back to the the topics as presented in the slides,
622
14:01:10,610 --> 14:01:12,850
Enrico Fermi Institute: so we'll just get started in a few minutes here,
623
14:01:53,380 --> 14:01:56,370
Maria Girone: so it's a Eric starting first, right.
624
14:01:56,540 --> 14:02:04,520
Maria Girone: Yeah, if Eric is ready to present, we thought maybe it would be best to just have them
625
14:02:07,990 --> 14:02:16,680
Enrico Fermi Institute: getting a little bit late. Concerned. Yeah, exactly. We want to be considerate of people's time in your especially. Thank you.
626
14:02:29,590 --> 14:02:39,579
Enrico Fermi Institute: So just give it like two more minutes, and then um, Eric, whenever you're ready to, you know. Put your slides up. I'll I'll stop sharing here. Um, when we get started shortly.
627
14:02:42,740 --> 14:02:47,450
Eric Wulff: Sounds good. I'm uh ready. Whenever So just let me know. Okay,
628
14:02:48,630 --> 14:02:49,570
you
629
14:02:54,390 --> 14:02:55,219
The
630
14:03:09,350 --> 14:03:17,650
Enrico Fermi Institute: It seems like the rate at which people have. Uh that rejoining has has slowed down significantly. So I think you can go ahead and and and and start
631
14:03:22,080 --> 14:03:23,529
Eric Wulff: uh, Okay.
632
14:03:24,610 --> 14:03:25,870
Eric Wulff: So
633
14:03:27,290 --> 14:03:31,050
Eric Wulff: i'm sharing. Now, I think. Can you see?
634
14:03:31,340 --> 14:03:33,999
Eric Wulff: Yes, it looks good. Okay, great.
635
14:03:34,560 --> 14:03:37,929
Eric Wulff: Um. So I I just have a
636
14:03:38,180 --> 14:03:52,689
Eric Wulff: two or three slides here. So it's a very short presentation just to talk a little bit about what we have been doing uh regarding distributed training and hypertuning uh of deep learning based algorithms using you have from us computing.
637
14:03:53,360 --> 14:04:00,499
Eric Wulff: So this is something that I have been doing in context of the A Eu Funded Research project called Say, We race
638
14:04:06,260 --> 14:04:08,620
Eric Wulff: involved in this, and she's my supervisor.
639
14:04:09,580 --> 14:04:10,969
Um.
640
14:04:12,850 --> 14:04:15,450
So let's see if I can change slide.
641
14:04:15,770 --> 14:04:17,940
Eric Wulff: Yes, um.
642
14:04:18,590 --> 14:04:24,429
Eric Wulff: So just for for if you're not aware uh hyper parameter organization. Um.
643
14:04:25,320 --> 14:04:35,079
Eric Wulff: So if you're not aware of what that is, I've tried to use it very quickly here in just one slide. So it's. I will sometimes refer to it as as a hyper tuning,
644
14:04:35,140 --> 14:04:36,670
Eric Wulff: and um,
645
14:04:36,730 --> 14:04:39,300
Eric Wulff: it's basically to um
646
14:04:39,340 --> 14:04:49,350
Eric Wulff: to tune the uh hyper parameters all the an Ai model or a deep learning model, and hyper parameters are simply the model sets. Um
647
14:04:58,840 --> 14:05:09,139
Eric Wulff: um, and they can define things like the model architecture. So, for instance, how many layers you have in your neural network? Um, How many notes you have in each layer, and so on.
648
14:05:09,520 --> 14:05:19,239
Eric Wulff: Um, but they also define things. Um, that has to do with the optimization of the model, such as the learning rates, the back size and so forth.
649
14:05:19,720 --> 14:05:20,570
Yeah.
650
14:05:22,180 --> 14:05:28,950
Eric Wulff: Now, if you have a a a large model, or a very top complex model, which it requires a lot of compute to
651
14:05:29,220 --> 14:05:30,469
Eric Wulff: and
652
14:05:31,480 --> 14:05:33,510
Eric Wulff: uh, to do the forward pass,
653
14:05:33,610 --> 14:05:34,950
Eric Wulff: and
654
14:05:35,630 --> 14:05:38,329
Eric Wulff: and or you have a large data sets.
655
14:05:38,360 --> 14:05:41,660
Eric Wulff: Um. Hypertine can be extremely
656
14:05:41,940 --> 14:05:56,630
Eric Wulff: compute resource intensive. So, therefore it can benefit greatly from Hbc. Resources. And uh, Furthermore, we need a of smart and efficient solid search algorithms to find good hyper parameters, so that we we don't waste the Hpc resources that we have
657
14:05:59,290 --> 14:06:00,480
Eric Wulff: um.
658
14:06:01,000 --> 14:06:10,500
Eric Wulff: So in race uh, I have been working with uh a group working on machine and particle flow uh, which is a
659
14:06:10,810 --> 14:06:13,939
Eric Wulff: uh in collaboration with Cms
660
14:06:14,080 --> 14:06:17,230
Eric Wulff: with people from Cms. Um, And
661
14:06:17,420 --> 14:06:19,599
Eric Wulff: in order to high opportunity, this model
662
14:06:19,690 --> 14:06:25,310
Eric Wulff: um in race we have been using uh an open source framework called rate you
663
14:06:25,750 --> 14:06:34,059
Eric Wulff: uh, which allows us to run many different trials in parallel, using uh multiple gpus per trial
664
14:06:34,270 --> 14:06:39,010
Eric Wulff: uh, which is uh what this picture up here is trying to represent.
665
14:06:39,570 --> 14:06:40,990
Eric Wulff: And
666
14:06:42,990 --> 14:06:51,389
Eric Wulff: now, with Rachel we can also get the very nice overview of the different trials, and we can. We can pick the one that we see, performs the best
667
14:06:51,580 --> 14:06:57,289
Eric Wulff: uh and right, and also has a lot of different search algorithms that uh
668
14:06:57,660 --> 14:07:01,359
Eric Wulff: help us to in the the right uh
669
14:07:01,690 --> 14:07:02,970
Eric Wulff: I, the parameters.
670
14:07:03,430 --> 14:07:18,949
Eric Wulff: And here, to the right, we have an example of of the kind of a difference this can make to to the learning of the model. So Here we have plotted the um training and validation losses for, and after hyper tuning,
671
14:07:20,620 --> 14:07:32,120
Eric Wulff: so as you can see here, the the loss went down quite a bit after hypertuning almost by a factor of two, and the furthermore, the the training seems to be much more stable. We have a
672
14:07:32,380 --> 14:07:36,559
Eric Wulff: these bands which will present the the standard deviation of
673
14:07:36,750 --> 14:07:42,170
Eric Wulff: between different trainings. It's it's much more stable in the right plot.
674
14:07:47,030 --> 14:07:56,090
Eric Wulff: Um and I just had one more slide here to sort of illustrate how you still uh high performance computing can be in order to speed up
675
14:07:56,810 --> 14:07:58,380
parameter optimization.
676
14:07:58,560 --> 14:08:03,430
Eric Wulff: Uh. So this just shows the scaling uh from four to twenty-four
677
14:08:03,680 --> 14:08:05,309
Eric Wulff: computing notes.
678
14:08:05,330 --> 14:08:06,550
Eric Wulff: Um,
679
14:08:06,990 --> 14:08:15,439
Eric Wulff: maybe particularly looking at the plot to the right here we can see that the scaling for this use case is actually better than linear
680
14:08:15,570 --> 14:08:20,269
Eric Wulff: um, which at least in part has to do with, uh
681
14:08:20,820 --> 14:08:26,109
Eric Wulff: some excessive reloading of models that happens when when we have the few notes.
682
14:08:28,060 --> 14:08:29,150
Eric Wulff: Um.
683
14:08:31,070 --> 14:08:35,830
Eric Wulff: So. Um: Well, this basically means that the more the more
684
14:08:36,030 --> 14:08:41,099
Eric Wulff: notes we have, the more people we have with the faster we can tune and apply these bottles.
685
14:08:41,670 --> 14:08:47,480
Eric Wulff: That's all I had for for this.
686
14:08:48,740 --> 14:08:58,029
Enrico Fermi Institute: Can you tell a priori from the model that that you'll that the model you're using will
687
14:08:58,080 --> 14:09:04,340
Enrico Fermi Institute: force up behavior, so that if someone comes with any given model, you know how to sort of shape the work,
688
14:09:06,550 --> 14:09:15,609
Enrico Fermi Institute: if you understand what I mean and no, no. What I mean is, you discovered that you get better than linear scaling with this training?
689
14:09:15,700 --> 14:09:16,719
Right?
690
14:09:17,160 --> 14:09:22,499
Enrico Fermi Institute: That's not always the case with, Or is that the case with any given model.
691
14:09:23,150 --> 14:09:24,459
Um,
692
14:09:25,150 --> 14:09:33,199
Eric Wulff: yeah, I think it would be so. This is sort of uh. This is showing the scaling of the hyper parameter organization itself.
693
14:09:33,650 --> 14:09:40,180
Eric Wulff: Um, so it's not. If if you had just a single training, it wouldn't scale like this it would be
694
14:09:40,360 --> 14:09:42,610
Eric Wulff: a a bit worse than linear probably.
695
14:09:45,610 --> 14:09:51,289
Eric Wulff: But So the way that the hypertuning works in this case is that we
696
14:09:51,430 --> 14:09:53,199
Eric Wulff: we launched a bunch of
697
14:09:53,690 --> 14:09:56,980
Eric Wulff: trials in parallel with different type of parameter
698
14:09:57,010 --> 14:09:58,559
Eric Wulff: configurations.
699
14:09:58,990 --> 14:10:00,189
Eric Wulff: And then
700
14:10:00,340 --> 14:10:01,780
Eric Wulff: um!
701
14:10:02,230 --> 14:10:10,820
Eric Wulff: There is a sort of a scheduling or search algorithm, looking at how well all these trials perform,
702
14:10:10,940 --> 14:10:22,829
Eric Wulff: and then it's a terminates once that look less promising and continuous training, the ones that look promising. And then we can also have some kind of base and optimization
703
14:10:23,190 --> 14:10:26,360
Eric Wulff: component here, which tries to predict which
704
14:10:27,470 --> 14:10:31,230
Eric Wulff: hyper parameters would perform. Well, and then we try those next,
705
14:10:32,930 --> 14:10:39,059
Enrico Fermi Institute: and if you were to double or triple the number of nodes you would continue to does the
706
14:10:39,310 --> 14:10:42,929
Enrico Fermi Institute: does the actual growth begin to flat now?
707
14:10:43,430 --> 14:11:00,910
Eric Wulff: Um! I I haven't tested this um more than up to twenty four notes uh, so I can't say for sure, but I I imagine it will continue for at least a bit more. But um I I can't say for how long, and
708
14:11:01,060 --> 14:11:16,039
Enrico Fermi Institute: I I also see that eventually it would flack off.
709
14:11:17,080 --> 14:11:18,540
Eric Wulff: Um,
710
14:11:19,510 --> 14:11:23,909
Enrico Fermi Institute: because that's all it. Yeah, it it's nothing. The issue is a resource contention.
711
14:11:24,600 --> 14:11:30,520
Eric Wulff: Yeah, it's a that has to do with the with this search. Algorithm That's
712
14:11:30,630 --> 14:11:32,309
Eric Wulff: um
713
14:11:33,180 --> 14:11:39,990
Eric Wulff: trains a few trials and then terminates bad once, and then continues with new ones. So
714
14:11:40,360 --> 14:11:48,789
Eric Wulff: if you have more more trials than you have notes that that you want to run uh. You have to sort of the
715
14:11:49,280 --> 14:11:54,179
Eric Wulff: post trials at some point, and can and start training other ones.
716
14:11:54,590 --> 14:11:56,110
Eric Wulff: Um!
717
14:11:56,270 --> 14:12:02,699
Eric Wulff: Because you need to trade all the trials up to the same epoch number before you decide which ones to keep, and not
718
14:12:04,140 --> 14:12:11,450
Eric Wulff: so it. It. It doesn't have to do with ray tune per se. It just has to do with the the particular search algorithm or
719
14:12:11,530 --> 14:12:15,219
Eric Wulff: a lot of search algorithms actually work work like that.
720
14:12:18,070 --> 14:12:19,019
Yeah,
721
14:12:19,250 --> 14:12:21,929
Enrico Fermi Institute: you have a question or comment for me in the chat.
722
14:12:22,100 --> 14:12:40,870
Ian Fisk: Yeah, I had a question for Eric which was, and maybe it's too early to tell. But my question was, how stable you expected the hyper parameter tuning to be in the sense that are we expecting that every time we change network or get new data, we're going to have to re-optimize the hyper parameters. Or is this something that
723
14:12:40,880 --> 14:12:50,119
Ian Fisk: um that once we sort of ha I optimize for a particular problem that we may find that those are stable over periods of time. The reason, I ask is that This seems like A.
724
14:12:50,620 --> 14:12:59,900
Ian Fisk: When we talk about the use of Hpc. Or clouds and specialized resources, like training is A is a big part of how we tend to use them. But the hyper parameter
725
14:13:00,190 --> 14:13:11,330
Ian Fisk: optimization sort of increases that by a factor of fifty or so. And so, if we have to do it each time. We probably need to factor those things in in our thoughts about how we're where we're constrained resources.
726
14:13:12,110 --> 14:13:14,099
Eric Wulff: Yeah, so
727
14:13:14,770 --> 14:13:16,039
Eric Wulff: um,
728
14:13:16,760 --> 14:13:23,389
Eric Wulff: it. It would completely depend on how much you change your model, or how much you change the problem.
729
14:13:23,470 --> 14:13:24,989
Eric Wulff: I mean, if you're
730
14:13:25,010 --> 14:13:27,139
Eric Wulff: if you change your model
731
14:13:27,180 --> 14:13:32,739
Eric Wulff: architecture, I it, you will probably have to run a new hyper primary organization.
732
14:13:32,770 --> 14:13:38,310
Eric Wulff: Um, because you might do not even have the same hyper parameters in your model anymore.
733
14:13:38,550 --> 14:13:40,150
Eric Wulff: Uh,
734
14:13:40,610 --> 14:13:56,560
Eric Wulff: and but But you know, if if things aren't two different, you might not have to to hypertune, you might. You might, or just, maybe, as to a small hyper tuning, you know, just a few parameters in in some narrow or small search space.
735
14:13:56,690 --> 14:13:58,640
Eric Wulff: So, for instance,
736
14:13:59,020 --> 14:14:00,809
Eric Wulff: if you look at other
737
14:14:00,840 --> 14:14:01,950
Eric Wulff: uh
738
14:14:02,920 --> 14:14:06,280
Eric Wulff: ah, other fields, such as, for instance, a
739
14:14:06,390 --> 14:14:09,070
Eric Wulff: image recognition, or all the detection.
740
14:14:09,210 --> 14:14:26,879
Eric Wulff: Um, if you find a network that performs well on, you know, classifying certain kinds of objects uh, then it's very likely that they, you know, using the same, have a parameters. It would be good at classifying other kinds of objects as well. If you just have labour data for for those objects.
741
14:14:26,890 --> 14:14:29,329
So in that case, probably you wouldn't have to
742
14:14:31,100 --> 14:14:33,510
Eric Wulff: run a full hyper-prampt organization again.
743
14:14:37,260 --> 14:14:46,599
Ian Fisk: Thanks. It's a it's it's thanks. It's It's very impressive. The amount that it improves the situation by doing the separately. Getting a factor of two is nice
744
14:14:48,460 --> 14:14:49,360
Eric Wulff: Thanks.
745
14:14:50,810 --> 14:15:07,050
Paolo Calafiura (he): A question or comment from Paul. Yes, I hope to the question I miss. I missed the the first couple of nights. Sorry of the question. I wasn't address there. So my question is here You're starting to show the the scaling at four nodes,
746
14:15:07,060 --> 14:15:13,339
Paolo Calafiura (he): and I wonder what would the scaling look like if you compare it with a single null or in a single gpu.
747
14:15:14,870 --> 14:15:16,540
Eric Wulff: Um.
748
14:15:26,890 --> 14:15:32,669
Eric Wulff: The few notes you have the more all this excessive reloading has to happen.
749
14:15:32,930 --> 14:15:37,320
Eric Wulff: So you're just just using one now would be very, very slow.
750
14:15:37,510 --> 14:15:50,440
Paolo Calafiura (he): But that's because of the way does it does this business. It's because of the search algorithm we use. So it's not the way to per. Say It's the
751
14:15:51,360 --> 14:15:58,859
Eric Wulff: it's because of the algorithm you. You wouldn't be able to run this faster with another framework. Well, I mean
752
14:15:59,760 --> 14:16:18,139
Paolo Calafiura (he): it. It. It. It's the algorithms problem, not way, too. So it's It's a little bit harder than to to do the the comparison. I mean, i'm thinking, if you use psychic labels like it, optimize on single Gpu to do to do the same thing. And then, of course, there is the question, What is the
753
14:16:22,910 --> 14:16:26,699
Paolo Calafiura (he): Okay, it's It's a complicated question.
754
14:16:29,870 --> 14:16:32,029
Okay? Next we have
755
14:16:34,400 --> 14:16:45,700
Shigeki: Uh yeah, I'm gonna show my ignorance here. Um, just trying to understand exactly how this works. Uh: I think i'm on the first slide. Second slide.
756
14:16:45,730 --> 14:16:54,140
Shigeki: You show the trial one trial to trial trial, and those trials are independent of each other. Right? They're all working on.
757
14:16:54,440 --> 14:17:12,849
Shigeki: Okay, uh. The next thing here is that presumably they're they're reading the same set of data over in a uh in order to train uh, they don't. They're completely independent in terms of of the of the where they are in the input. Stream. Right? They're. They're not like working in lockstep or anything.
758
14:17:13,630 --> 14:17:25,690
Eric Wulff: This is prior one. So it it. It depends on the kind of the search algorithm that you use the hyper perimeter search algorithm So um
759
14:17:26,590 --> 14:17:27,650
Eric Wulff: in um.
760
14:17:28,350 --> 14:17:40,270
Eric Wulff: Well, to to be with you. You you can choose not to use any particular search algorithm and then everything is just done uh in parallel sort of um,
761
14:17:40,560 --> 14:17:41,710
Eric Wulff: however,
762
14:17:42,000 --> 14:17:53,250
Eric Wulff: and it's it's It's much more efficient to use some kind of search. Algorithm So then um! You would want to train all the trials up to a certain
763
14:17:53,570 --> 14:17:58,200
Eric Wulff: epoch number. Let's say you train them all up to you. Put five, and then you look at
764
14:17:58,530 --> 14:18:08,800
Eric Wulff: uh, they have some algorithm that decides which wants to terminate, and which ones to continue training, and in place of the ones you terminated, you start new trials
765
14:18:08,820 --> 14:18:12,450
Eric Wulff: with the with new hyper parameter configurations.
766
14:18:12,500 --> 14:18:19,529
Eric Wulff: Um. So then, that if you have many more trials, then you have confused notes. You have to
767
14:18:19,720 --> 14:18:27,839
Eric Wulff: pause some of some trials at a point five, and then load in new trials and train them out until they book five.
768
14:18:28,230 --> 14:18:30,749
Shigeki: Okay. So
769
14:18:31,070 --> 14:18:35,280
Shigeki: okay. But to a certain extent, though the the the trials are running independent,
770
14:18:35,290 --> 14:18:51,889
Shigeki: and they get synchronized at some point by by the Atlantic. That you that you're that you're stopping at. But other than that within up to that epoch point uh they're running. They're they're what they're blasting through the the the data as quickly as they they they can. And And so they? They're not in sync. Okay,
771
14:18:52,640 --> 14:18:53,690
Shigeki: thank you.
772
14:18:56,430 --> 14:18:59,330
Enrico Fermi Institute: So how long does it take to run this on,
773
14:18:59,370 --> 14:19:07,800
Enrico Fermi Institute: you know, for for one node? You know. How long is it running the the hyper parameter optimization in terms of all, all all time? Hours?
774
14:19:08,120 --> 14:19:09,599
Eric Wulff: Um!
775
14:19:10,010 --> 14:19:11,059
Eric Wulff: So
776
14:19:11,130 --> 14:19:21,010
Eric Wulff: that that can vary a lot, depending on how large your search basis and the can, what we use and the data that we use, and so on, I think for the for the results I show here.
777
14:19:21,310 --> 14:19:22,860
Eric Wulff: Um
778
14:19:23,820 --> 14:19:26,859
Eric Wulff: uh, If I remember correctly,
779
14:19:27,120 --> 14:19:33,029
Eric Wulff: the whole thing took uh around eighty hours in
780
14:19:33,190 --> 14:19:35,740
Eric Wulff: in wall time,
781
14:19:35,980 --> 14:19:40,909
Eric Wulff: and that's using uh that was using uh twelve
782
14:19:40,930 --> 14:19:45,800
Eric Wulff: confused notes with four to us each.
783
14:19:45,810 --> 14:20:11,110
Enrico Fermi Institute: That can be, you know, trivially broken up into into multiple drops and things like that. The reason I ask is one of the things I notice is that on you know some of the Hpcs uh, at least in the Us. Right. They they have, you know, maximum wall time, for you know you jobs in the queues right? So like I'm i'm looking at, you know pearl matter right now, and it says you can have a a gpu job in the regular queue uh for twelve hours at most.
784
14:20:11,120 --> 14:20:15,659
Enrico Fermi Institute: And so i'm wondering like, what what useful work can we get done, or
785
14:20:15,870 --> 14:20:25,280
Enrico Fermi Institute: you know, hyperparameter, optimization or machine learning in general, you know, given the relatively short maximum of all time.
786
14:20:25,450 --> 14:20:29,280
Eric Wulff: Um. So one solution is to uh
787
14:20:29,460 --> 14:20:31,290
Eric Wulff: tick points. The
788
14:20:31,950 --> 14:20:39,149
Eric Wulff: the the search, and then just launch it again and continue where you left off. So the we're able to do that. So
789
14:20:39,190 --> 14:20:44,300
Eric Wulff: we are saving checkpoints regularly through the the workload.
790
14:20:45,570 --> 14:20:47,679
Eric Wulff: Okay? And uh, yeah,
791
14:20:47,820 --> 14:20:50,360
Enrico Fermi Institute: how often do you save the checkpoints?
792
14:20:51,280 --> 14:21:07,169
Eric Wulff: Um, That's configurable, But usually once per epoch. So once once per read through data sets.
793
14:21:08,020 --> 14:21:15,920
Eric Wulff: Uh that. That depends a lot also. But um, let's say you around well between twelve and twenty four hours.
794
14:21:17,110 --> 14:21:20,540
Eric Wulff: But this completely depends on how much data you have. And uh,
795
14:21:21,140 --> 14:21:24,060
Eric Wulff: you know the the particular model they use.
796
14:21:24,530 --> 14:21:41,880
Enrico Fermi Institute: That's an epoch for the hyper parameter optimization itself, not just the the neural net a single instance of the neural network
797
14:21:42,740 --> 14:21:45,710
twenty-four hours for a single,
798
14:21:46,740 --> 14:21:53,449
Eric Wulff: and that's um. So that you know we have quite a big data set. So that's
799
14:21:53,510 --> 14:22:00,430
Eric Wulff: why. But we're also using four G four, and the J. One hundred gpus for that. So
800
14:22:00,820 --> 14:22:02,320
Eric Wulff: if you have a
801
14:22:02,640 --> 14:22:05,420
Eric Wulff: all the gpus that would take much longer,
802
14:22:08,980 --> 14:22:19,460
Enrico Fermi Institute: I I guess What I'm wondering is, you know, for for the report, should we, you know, have some recommendation that the the policies at these sites. You know how
803
14:22:20,140 --> 14:22:25,540
Enrico Fermi Institute: you know much longer Gpu jobs to run to do these sorts of tasks.
804
14:22:26,090 --> 14:22:29,069
Eric Wulff: Well, my opinion is that it would be
805
14:22:29,720 --> 14:22:47,669
Enrico Fermi Institute: it would be convenient to see if it if we could. But you know it's not deal breaking, because we can't checkpoint this, and just to relo right. But can you for it? You just said your your epochs are twelve to twenty-four hours, and Lincoln just said that
806
14:22:47,720 --> 14:22:57,990
Eric Wulff: twelve hours. So the sorry sorry sorry. So I uh, yeah, yeah, I I I I spoke here so
807
14:22:58,500 --> 14:23:13,459
Eric Wulff: uh apologies. It's a bit late over here. So it it takes it takes twenty-four hours for a full training. Not for one.
808
14:23:13,470 --> 14:23:33,439
Enrico Fermi Institute: We're not asking for a policy change, right? Just a behavioral change with checkpointing. And you're saving at the end of each full training or each actual. So it's as much. Uh: yeah, yeah, Sorry for it. You have, like two hundred epochs. Is that right? Yeah, you're probably having the plot.
809
14:23:33,650 --> 14:23:37,789
Eric Wulff: Uh: yeah, yeah, in the plot here. So Um:
810
14:23:38,030 --> 14:23:56,069
Eric Wulff: yeah. And so the this is plot from last year. Now we have a large data set, and we train for about a hundred epochs, and that takes uh roughly, twenty four hours.
811
14:23:57,900 --> 14:23:59,820
Enrico Fermi Institute: Okay, Um,
812
14:24:00,170 --> 14:24:13,310
Enrico Fermi Institute: yeah, with adding more gpus per node help you in terms of a number of epochs? Or do you have enough data to get reasonable convergence with, or at least with this model after one hundred? You
813
14:24:21,110 --> 14:24:22,430
Eric Wulff: actually we are.
814
14:24:22,690 --> 14:24:27,659
Eric Wulff: We just saw that if we scale up our model
815
14:24:27,690 --> 14:24:40,729
Eric Wulff: significantly, so make make the model larger. With many more parameters we can easily improve the physics performance. Um. So we just try that the
816
14:24:41,300 --> 14:24:44,330
Eric Wulff: this week,
817
14:24:44,660 --> 14:24:47,859
Eric Wulff: because we were curious. Basically Uh, however,
818
14:24:47,920 --> 14:24:49,790
Eric Wulff: that's sort of not a
819
14:24:58,390 --> 14:25:02,050
Eric Wulff: quickly enough in production, anyway.
820
14:25:02,590 --> 14:25:03,639
Eric Wulff: Um,
821
14:25:06,150 --> 14:25:08,350
Eric Wulff: but it sort of shows that the
822
14:25:08,440 --> 14:25:17,159
Eric Wulff: there is enough information in the data to do better. We just uh need to improve the model or the the training of the model somehow.
823
14:25:20,160 --> 14:25:25,100
Enrico Fermi Institute: Okay, Um, see, you have your hand raised.
824
14:25:25,830 --> 14:25:42,530
Shigeki: Uh, yeah, I just have a question on in terms of the amount of data you're going through, and the model size. Uh, I guess that's measured in terms of number of parameters as well as hyper parameters. And whether or not This Is Is there a Is there a a a a
825
14:25:42,540 --> 14:25:54,120
Shigeki: size that that physics problems, and in atp tend to gravitate to, or it can be all over the map in terms of model size and data, set size and and number of hyper parameters.
826
14:25:55,040 --> 14:25:56,179
Eric Wulff: Um!
827
14:25:56,320 --> 14:26:00,129
Eric Wulff: So the number of heavy parameters. Um,
828
14:26:00,190 --> 14:26:07,620
Eric Wulff: that's a little bit arbitrary, dependent on what you mean with have parameters. So if you
829
14:26:08,040 --> 14:26:10,180
Eric Wulff: uh if you count
830
14:26:10,250 --> 14:26:11,389
Eric Wulff: well,
831
14:26:11,430 --> 14:26:13,889
Eric Wulff: you you you can configure
832
14:26:14,040 --> 14:26:23,330
Eric Wulff: but very many things with our model. So if you, if you count all those hyper parameters, I don't know how many they are, but there are hundreds, and we don't two, not of them, because they're too many.
833
14:26:28,100 --> 14:26:33,720
Eric Wulff: Uh, the the number of trainable parameters in the model is around one million,
834
14:26:34,130 --> 14:26:37,850
Eric Wulff: so that's fairly small, if you
835
14:26:37,890 --> 14:26:39,450
Eric Wulff: compared with other uh
836
14:26:40,090 --> 14:26:46,880
Eric Wulff: other sciences, like image recognition, or natural language processing, then this is really a small model.
837
14:26:47,030 --> 14:26:48,389
Eric Wulff: Um!
838
14:26:48,570 --> 14:26:50,480
Eric Wulff: How we we think that
839
14:26:50,580 --> 14:26:52,679
Eric Wulff: I I actually don't know the
840
14:26:53,190 --> 14:26:57,809
Eric Wulff: the memory requirements that we have to uh
841
14:26:57,850 --> 14:27:05,289
Eric Wulff: that here, too, if this would go into production at some point in the future. But I don't think we could go much larger
842
14:27:05,410 --> 14:27:19,759
Eric Wulff: uh, at least not without uh doing some kind of conversation. Uh, we're training or post training, conversation, or perhaps pruding weights after training or doing some other tricks like that
843
14:27:19,990 --> 14:27:23,109
Eric Wulff: uh data set size. So the
844
14:27:23,680 --> 14:27:26,389
Eric Wulff: the one we are currently using.
845
14:27:30,540 --> 14:27:34,559
Eric Wulff: I think it's around four hundred thousand events
846
14:27:35,000 --> 14:27:38,260
Eric Wulff: collision events of of the the different kinds.
847
14:27:40,140 --> 14:27:44,790
Shigeki: Do you have an approximate idea of how much actual gigabytes that is?
848
14:27:45,140 --> 14:27:46,559
Eric Wulff: Um
849
14:27:47,210 --> 14:27:48,730
Shigeki: auto-
850
14:27:49,250 --> 14:27:51,920
Eric Wulff: is a few hundred gigabytes
851
14:27:52,100 --> 14:27:54,480
Eric Wulff: less than a thousand,
852
14:27:55,530 --> 14:28:08,920
Shigeki: and presumably when you're when you're running this, it's, it's it's it's it's compute bound not not not uh a I o bound from from uh in terms of feeding the they uh, the the training data,
853
14:28:08,950 --> 14:28:11,229
Shigeki: or it depends.
854
14:28:11,450 --> 14:28:18,439
Eric Wulff: No, I would say it's compute bound. Oh, you mean looking at the Gpu utilization. It goes to
855
14:28:18,590 --> 14:28:20,070
Eric Wulff: it close to one hundred.
856
14:28:20,139 --> 14:28:22,229
Shigeki: Mhm Okay, thanks.
857
14:28:22,559 --> 14:28:27,009
Enrico Fermi Institute: And you know how much of the memory and the Gpu you're using, or have you?
858
14:28:27,570 --> 14:28:30,279
Eric Wulff: Uh, yes, we uh we,
859
14:28:30,400 --> 14:28:33,209
Eric Wulff: you see, all of it. Basically
860
14:28:34,049 --> 14:28:40,529
Enrico Fermi Institute: So then you're not. It would not help you to have centers that chop up these big gpus.
861
14:28:41,969 --> 14:28:45,769
Eric Wulff: I don't think so. Um. So there is a problem.
862
14:28:45,930 --> 14:28:57,160
Eric Wulff: Um, with having two large batch sizes sometimes. Um basically in order to fill up the gpu. You you increase the bad size as you use for training.
863
14:28:57,230 --> 14:28:58,449
Eric Wulff: Um,
864
14:28:59,530 --> 14:29:05,829
Eric Wulff: and that means you can push more date, though,
865
14:29:05,850 --> 14:29:14,719
Eric Wulff: through per time units, but you know it. It doesn't necessarily mean you can do more optimization steps. So you you might not
866
14:29:14,879 --> 14:29:17,020
Eric Wulff: uh reach
867
14:29:17,160 --> 14:29:20,090
Eric Wulff: the same accuracy quicker.
868
14:29:26,629 --> 14:29:38,190
Eric Wulff: It's it's not obvious or so it's always the case that you can just uh throw more memory at it than it helps. Yeah, I was actually thinking of swapping it the other way with. Uh,
869
14:29:38,990 --> 14:29:45,470
Enrico Fermi Institute: we have a question in our data center of how much we should chop up using Meg the a one hundreds,
870
14:29:47,480 --> 14:29:50,440
Enrico Fermi Institute: you know. Give person a whole
871
14:29:51,010 --> 14:29:54,830
Enrico Fermi Institute: eighty gigs. Were split it up two ways or four ways
872
14:29:55,139 --> 14:30:03,550
Eric Wulff: uh to to several users at the same time.
873
14:30:05,549 --> 14:30:06,580
Enrico Fermi Institute: Thanks.
874
14:30:07,530 --> 14:30:09,519
Enrico Fermi Institute: Show another comment:
875
14:30:12,860 --> 14:30:17,950
Enrico Fermi Institute: Sorry I got to the
876
14:30:18,650 --> 14:30:27,329
Dirk: yeah. I I had a question, and it's not so much. I mean, Eric, if you know you can answer, but it's more uh looking at broader,
877
14:30:27,559 --> 14:30:38,899
Dirk: the and more broader impact of that, and follow on because this is this is interesting, and this is on the But What's the next step? Have there been any discussions how
878
14:30:38,969 --> 14:30:41,610
Dirk: to integrate this in like?
879
14:30:41,700 --> 14:30:58,269
Dirk: Eventually? You You said It's work. It's improving particle. Flow. So eventually it should feed back into the Re. How we run the the reconstruction? Basically, And then the question comes, uh, what, how would you actually deploy this? How often do you have to run it?
880
14:30:58,540 --> 14:31:19,770
Dirk: How long does it take? And and how often do I have to renew like, renew it Basically, with new data to to check that the parameters are still okay and has has, and it's not just a question about the specific thing that So this is like the larger questions. Maybe Lindsay or I don't know if Mike might ask for Link connected if there have been any
881
14:31:19,780 --> 14:31:26,789
Dirk: discussions of that already, or or if that's still to come after the on. The initial on D is done.
882
14:31:30,130 --> 14:31:33,150
Eric Wulff: Well, I would say, if uh,
883
14:31:33,470 --> 14:31:36,980
Eric Wulff: if we are able to prove, or
884
14:31:37,030 --> 14:31:38,920
Eric Wulff: somehow show, that
885
14:31:39,020 --> 14:31:43,090
Eric Wulff: this machine learned approach to particle flow works
886
14:31:43,170 --> 14:31:44,490
Eric Wulff: uh
887
14:31:44,880 --> 14:31:52,579
Eric Wulff: as well, but more efficiently, or or even uh better than the uh
888
14:31:52,610 --> 14:31:54,660
Eric Wulff: method that are used at the moment.
889
14:31:55,670 --> 14:31:59,449
Eric Wulff: Um. Then we then we sort of free that model and
890
14:31:59,690 --> 14:32:04,779
Eric Wulff: get it into production, and then we shouldn't need to redo any hyper,
891
14:32:04,820 --> 14:32:34,339
Dirk: current documentation or anything like that. Then it, you know we Then it's like having a finished algorithm, that. Just Yeah. But the data taking the detector changes all the time. So who knows if the twenty right. If if the training you did on two thousand and twenty-two data, or even run two data is still valid for your next set of data. That's right. So we're we're not. We're not training on date, but we're trying a simulation. Okay, right. But but I think this is when we talk about these kind of a problems, and one of things needs to be studied
892
14:32:34,580 --> 14:32:44,590
Ian Fisk: is how stable these are, and whether they really like, cause it could be that we're incredibly lucky, and they once you hype once you do the hyper parameter optimization that it's applicable to
893
14:32:45,180 --> 14:32:51,009
Ian Fisk: small changes in data. Um, And one thing that this I think we can see from Eric's plots is that it?
894
14:32:51,050 --> 14:33:01,189
Ian Fisk: It makes these things faster. They train faster and better after they can optimize. And so if we were in unreasonably lucky, they'll actually save us resources.
895
14:33:02,360 --> 14:33:03,300
Okay,
896
14:33:03,500 --> 14:33:08,860
Dirk: Okay. But it sounds like It's a discussion that's still to come. That's not. We're not quite there yet.
897
14:33:09,400 --> 14:33:25,109
Ian Fisk: Well, I think so. I think the the thing we do is we given how much this improves the situation where chances are. And I think this is applies to multiple science fields, not just ourselves, that we should be factoring these things in in our discussion about how we're going to use Hc.
898
14:33:25,140 --> 14:33:35,829
Ian Fisk: Um for the report. Um! And then we'll have to wait and see as to whether this thing that's a a workful that we're constantly running, or one that we are running once in a while.
899
14:33:39,190 --> 14:33:47,179
Mike Hildreth: Yeah, I guess I would agree with that. Um, I don't. Yeah, we haven't had A. We. We don't have enough data.
900
14:33:47,840 --> 14:33:53,670
Mike Hildreth: How often we're going to have to train these. But this use case is certainly in the planning.
901
14:33:54,080 --> 14:33:55,760
Enrico Fermi Institute: Is it right?
902
14:33:55,850 --> 14:34:07,809
Enrico Fermi Institute: I think the one remaining worry is, we haven't been through like a complete recalibration cycle of the detector. Uh uh, after a stop or anything like that to see if
903
14:34:07,820 --> 14:34:21,400
Enrico Fermi Institute: to see if it or to see how robust a single training is, or the most optimal training is. With respect to the changing parameters of the detector, and it's just something we have to find out. But it's not going to change the pattern. All that much to be honest.
904
14:34:21,410 --> 14:34:28,360
Enrico Fermi Institute: But yeah, I agree with the in here. It's this: This is probably going to save us resources as well in the long run.
905
14:34:28,620 --> 14:34:30,320
Dirk: Okay, thanks.
906
14:34:30,510 --> 14:34:38,550
Dirk: That makes it difficult for us to write because we can write the use case in, but it's extremely hard to attach any numbers to it at the moment.
907
14:34:41,470 --> 14:34:55,099
Enrico Fermi Institute: Yeah, I mean, we, I guess, to another way to summarize it. We've shown that this works, and that we can get really great results out of it, but we haven't understood the true uh, you know, steady state operational parameters of of this system.
908
14:34:59,230 --> 14:35:04,370
Eric Wulff: And just to be clear like you, there there still needs to be a
909
14:35:04,610 --> 14:35:08,699
Eric Wulff: quite a bit of work before this would be ready to go into production.
910
14:35:09,140 --> 14:35:10,600
Eric Wulff: It's still
911
14:35:10,880 --> 14:35:14,050
Eric Wulff: uh like we, we, we don't understand
912
14:35:14,200 --> 14:35:18,509
Eric Wulff: all the properties of how it reconstructs particles well enough. Yet,
913
14:35:20,650 --> 14:35:23,980
Eric Wulff: although you know it's looking good, it's. It's looking promising,
914
14:35:24,230 --> 14:35:30,350
Eric Wulff: but it it needs to be validated and much more before production.
915
14:35:41,060 --> 14:35:44,129
Enrico Fermi Institute: So we have more question, for
916
14:35:46,660 --> 14:35:50,649
Enrico Fermi Institute: I guess one silly question
917
14:35:51,140 --> 14:36:03,900
Enrico Fermi Institute: in terms of actually trying to use this like in Cmssw. And this is mostly because I don't remember the last time that Joseph presented this, How fast does this go per event in inference mode?
918
14:36:04,220 --> 14:36:06,810
Enrico Fermi Institute: How many, what's the throughput look like?
919
14:36:06,940 --> 14:36:24,380
Eric Wulff: Um, I don't think we have done anything there that would be comparable to it. Production? So it um, or maybe an even better question is, what's what's the memory footprint look like on Gpu or Cpu
920
14:36:24,770 --> 14:36:31,000
Eric Wulff: uh, I don't know that on top of my head, but I know we have a plot somewhere that I can
921
14:36:31,100 --> 14:36:32,899
Enrico Fermi Institute: all good. Thank you.
922
14:36:37,540 --> 14:36:46,069
Enrico Fermi Institute: Okay. There are no other questions, and we can. You can move on. Ah, thank you very much for the presentation, Eric.
923
14:36:46,360 --> 14:36:48,119
Eric Wulff: No problem. Thanks for listening.
924
14:36:51,760 --> 14:37:05,870
Enrico Fermi Institute: Okay. Having some minor networking challenges here with the pay attention.
925
14:37:06,270 --> 14:37:12,320
Enrico Fermi Institute: I think that I think the room is still okay with sharing It's just in everybody's laptop. It's connected as your own as a
926
14:37:12,570 --> 14:37:27,260
Enrico Fermi Institute: yeah. The wired connection is fine. Yeah, um, dirk. You wanted to to talk about some more machine learning related topics while we were on this. Do you want to? Did you have other particular things to bring up? And then we bounce back to the network and stuff like that.
927
14:37:27,780 --> 14:37:43,740
Dirk: I think I think we can. We can follow the regular. We We put the machine learning early, because uh, because of some time constraints, but it's. I think it's. Maybe we do the rest of the R. And D as part of the the normal.
928
14:37:43,780 --> 14:37:45,420
Enrico Fermi Institute: Okay.
929
14:37:45,550 --> 14:37:49,740
Dirk: So figure out how to use this. So we were at impact. Right?
930
14:37:49,770 --> 14:37:50,810
Enrico Fermi Institute: Yeah,
931
14:37:50,880 --> 14:37:56,380
Dirk: I I can say a little bit something about that. And I think we we discussed some of that yesterday already,
932
14:37:56,630 --> 14:38:01,189
Dirk: but it's it's also including cloud. Now. So we're looking at both. Um.
933
14:38:01,450 --> 14:38:19,450
Dirk: So what happens if we actually start using a lot of Hpc. And cloud users uh how the integration with I mean, at the moment we run them opportunistically. So they are considered an add on. But if we ever get to a point where they like a large fraction of our overall resources.
934
14:38:19,660 --> 14:38:37,420
Dirk: What's the impact on our on our global computing infrastructure? And how does it impact our own, the owned resource that that are still in the mix. So it's basically you. You can look at that. And you basically would have a lot of compute external to our own resources in some way.
935
14:38:37,470 --> 14:38:43,410
Dirk: And then you you look at What does that mean for our own sides? What kind of changes
936
14:38:44,440 --> 14:39:02,069
Dirk: might potentially be needed there to facilitate large scale cloud, and to a large degree that will dispense on on on how much are we actually using the storage at Cloud to the Hpc. So if you, if you consider that you don't have any storage, and you have to stream, or some other way to get the data
937
14:39:02,110 --> 14:39:09,129
Dirk: in and out quickly and just process it on demand that that puts more pressure on our own sides. Well versus
938
14:39:09,150 --> 14:39:22,100
Dirk: if you look at what? Atlas, that they have a self-contained site. That's more that follows more. The the model of just bring up another side somewhere else on some external resources. But it's kind of mostly a self-contained sun.
939
14:39:22,530 --> 14:39:23,539
Dirk: Um!
940
14:39:23,700 --> 14:39:42,489
Dirk: The other impact is that if you want to, if if we decide tomorrow, for instance, that um our codes performs great on arm, and we should uh switch to it as much as possible, because it's more cost effective. You can actually do that much quicker on the cloud, like, for instance, for that Google side in principle
941
14:39:43,130 --> 14:39:59,530
Dirk: at this could decide tomorrow that oh, from now on we're providing arm, cpu or not into cpu anymore. Um, because you change the instance type. Um! You can't do that on our own resources. That's a much longer process of a multiple years to swap out resources.
942
14:39:59,560 --> 14:40:01,730
Dirk: And uh, yeah,
943
14:40:01,830 --> 14:40:21,730
Dirk: And the the other obvious issues is even if we get storage at the Uh Cloud Hbc: sites you have to uh worry about transfers, because all these these resources need to be integrated in our transfer infrastructure. We need to, uh have Rosie be able to connect somehow. Uh, maybe uh
944
14:40:22,480 --> 14:40:23,449
Dirk: have
945
14:40:24,520 --> 14:40:35,929
Dirk: it. It mentions in intermediary node services. I know B. And L. Has some global online endpoint that atlas to facilitate transfers to some Hbc. And things like that, so
946
14:40:36,540 --> 14:40:52,250
Dirk: that feeds directly into the last point, feeds directly into network integration. So it's not just the transfer services, but also the underlying transfer fabric, the the network connectivity of of the Cloud and Hpc. Sites on Hbc. Resources.
947
14:40:57,530 --> 14:41:07,960
Dirk: As I said, we discussed it yesterday, some of it already. And uh, the one comment was that we should break out hardware and service costs that are basically
948
14:41:08,930 --> 14:41:11,800
Dirk: so anything else, any other comments on this
949
14:41:17,770 --> 14:41:20,830
Enrico Fermi Institute: one of the things that we had talked about in our,
950
14:41:21,430 --> 14:41:33,769
Enrico Fermi Institute: you know, just discussions among the blueprint group. Uh, you know, before the the workshop here was Is there any impact on
951
14:41:34,040 --> 14:41:46,960
Enrico Fermi Institute: on grid sites? If we were to, you know, do something like shift, you know large amounts of certain kinds of workflows to cloud like we did a lot of,
952
14:41:46,980 --> 14:41:53,320
Enrico Fermi Institute: you know, a lot more simulation on which Pc. We have to, you know,
953
14:41:53,540 --> 14:42:01,009
Enrico Fermi Institute: with the with the tier twos run correspondingly more analysis or something like that? If that were the case, would they have to
954
14:42:01,330 --> 14:42:04,189
Enrico Fermi Institute: up their facilities in certain ways,
955
14:42:05,200 --> 14:42:12,550
Enrico Fermi Institute: or does that not make sense at all. Should we just anticipate that We'll be able to run all workload types and all all resources,
956
14:42:13,990 --> 14:42:15,060
things like that?
957
14:42:16,390 --> 14:42:19,609
Enrico Fermi Institute: I see There's a hand raised from Eric
958
14:42:34,640 --> 14:42:37,089
Eric Lancon: to export um
959
14:42:37,220 --> 14:42:39,349
Eric Lancon: the Cpu processing
960
14:43:17,160 --> 14:43:18,999
Eric Lancon: at the same site.
961
14:43:23,900 --> 14:43:40,379
Dirk: Yeah, that's something we we we worried about because the the impact on the data transfers for formula specifically, because if you look at how we designed the the Hep cloud where we basically treat the Hpc. As an external compute resource, and then most of the
962
14:43:40,540 --> 14:43:53,389
Dirk: the Dio and the data it actually goes through Fermi lab that this will. So far everything is holding up nicely, but eventually, as we scale up Hpc: use. There's probably going to be an impact on
963
14:43:53,480 --> 14:43:58,259
Dirk: on provisioning of of network and storage at at formula
964
14:44:22,190 --> 14:44:23,250
Um.
965
14:44:23,340 --> 14:44:25,430
Enrico Fermi Institute: Other comments on
966
14:44:25,660 --> 14:44:30,349
Enrico Fermi Institute: impacted Hpc Cloud use on the existing infrastructure.
967
14:44:37,250 --> 14:44:39,300
Steven Timm: I just say what I heard
968
14:44:39,420 --> 14:44:42,249
Steven Timm: you might not think about.
969
14:44:42,480 --> 14:44:43,560
Steven Timm: Uh.
970
14:44:43,730 --> 14:44:48,970
Steven Timm: This was not a Cms. To us. This was, but we were running a very
971
14:44:49,080 --> 14:44:58,119
Steven Timm: um. And then, calling with the Google Code for a newference server, we managed to saturate the same network before we live for short time between us and Google.
972
14:44:59,330 --> 14:45:02,110
Steven Timm: So uh, you can.
973
14:45:02,280 --> 14:45:06,529
Steven Timm: If you're doing inference, you have to be careful of your um.
974
14:45:17,080 --> 14:45:29,849
Enrico Fermi Institute: I have what is possibly a profoundly uninformed question, how much of our, how much of our Monte Carlo generation at the at the actual generator level is being
975
14:45:29,860 --> 14:45:38,420
Enrico Fermi Institute: done uh or well. But what is uh taking place on Gpus like using using Gpus to do the Monte Carlo integration, and i'm waiting
976
14:45:40,460 --> 14:45:56,779
Enrico Fermi Institute: because that is a significant fraction of time that we spend right now. I mean, that's what uh Alison and Cms zero, because uh, I mean a very quick search on the Internet informs us that uh,
977
14:45:56,790 --> 14:46:15,759
Enrico Fermi Institute: they could, or so one that Gpu and Monte Carlo integration has been around for ten more than ten years now, and to that the factors speed up for that integration is like a factor of fifty or something. Uh, though of course, this probably depends on the shape of the thing that you're integrating, and how many polls it has and whatnot.
978
14:46:15,870 --> 14:46:24,899
Enrico Fermi Institute: But has anyone looked at benchmarking that, And could it have a major impact if we could significantly reduce the
979
14:46:25,020 --> 14:46:28,389
Enrico Fermi Institute: the time to integrating
980
14:46:28,420 --> 14:46:37,380
Enrico Fermi Institute: time to getting an integrated cross-section, and then also the time to unleading the necessary whatever necessary amounts of events.
981
14:46:37,460 --> 14:46:48,019
Enrico Fermi Institute: And could that fit on the hpc resources better. Could we use that in any way? I'm not sure that after that this goes really open-ended but it seems like It's something we're not considering
982
14:46:48,150 --> 14:46:54,739
Enrico Fermi Institute: because it's a It would be a really nice way to hide a lot of the latency in our production workloads right now,
983
14:46:55,120 --> 14:46:57,210
Enrico Fermi Institute: so get rid of it, not even Highland
984
14:47:00,660 --> 14:47:13,370
Enrico Fermi Institute: Uh. Had. Yeah, this this was a really open, ended question. But have we have we looked at that? And uh, if we're not doing it now. After ten years There must be something wrong,
985
14:47:13,420 --> 14:47:20,730
Dirk: maybe, Lindsay, but you and Mike should be in the best position to be able to answer that question in terms of
986
14:47:21,190 --> 14:47:25,329
Enrico Fermi Institute: for something that is that old.
987
14:47:25,360 --> 14:47:31,520
Enrico Fermi Institute: There's either something wrong with it, or we've actually just not been paying attention to it for a decade.
988
14:47:31,530 --> 14:47:46,870
Enrico Fermi Institute: Um. So I have. Yeah, and I I I personally don't have any information on that, Mike. Do you have anything? I think the answer is zero as well, you know. So why why are we using this? That's kind of a weird one?
989
14:47:47,170 --> 14:47:49,719
Steven Timm: There's been studies recently that almost
990
14:47:49,890 --> 14:48:01,270
Steven Timm: the dominant part of generation is actually throwing the dates and rolling around numbers. But I don't know if that's true for him as a but I know it's through for doing so. I mean, could you envision a situation where you're
991
14:48:01,280 --> 14:48:16,659
Enrico Fermi Institute: It's either. So any random numbers for you and nothing else. Uh yeah, I mean, that's that's probably what a large portion of it is that they're throwing lots of random numbers in parallel. Um! They have very good money, or very good uh Rngs uh for for Gpus.
992
14:48:17,070 --> 14:48:35,200
Dirk: I I think the the question also goes a little bit out of scope, because we're not supposed to look into what's going on on the framework side and the software side, but maybe to to to to, and I mean from from the conversation I had with Muddy on a lot of the the We had this the effort to spend in terms of Gpu.
993
14:48:35,210 --> 14:48:39,789
Dirk: I think it's this: The simple answer is, we looked at the full chain.
994
14:48:40,240 --> 14:48:51,369
Dirk: Jen Sim, did you recall, plus whatever miscellaneous comes after? And then they decided that generation is not the primary target of
995
14:48:51,730 --> 14:49:02,819
Dirk: a porting effort, because it's not over all that important for us. It's less important than reconstruction and tracking. I mean It's just lowest hanging fruit,
996
14:49:03,210 --> 14:49:16,139
Dirk: and the picture changes, of course, depending on generator to generator. But that I think that's That's the simple answer. No effort. Focus on certain areas, and that's one of one of them that wasn't focused on.
997
14:49:16,150 --> 14:49:34,669
Enrico Fermi Institute: Yeah, I can see that. That's a reasonable answer, I guess. Uh looking at kind of the the shape of the compute facilities that we are getting from Hpc. Uh. Packaging up some huge job that you uh, you know, send out to an Hpc. And then get your a lot of time and get your answer back. Uh. It seems
998
14:49:34,680 --> 14:49:46,849
Enrico Fermi Institute: at least in terms of like the the the geometry or the topology of the that makes a lot more sense for the kind of resources we're talking about. But I understand that Rico is certainly a higher priority in terms of compute that
999
14:49:49,940 --> 14:49:53,110
Enrico Fermi Institute: that's that's sort of where my thinking is heading, that's all,
1000
14:49:57,580 --> 14:49:59,379
Enrico Fermi Institute: Steve. Did you have another comment?
1001
14:50:00,440 --> 14:50:01,420
Steven Timm: No
1002
14:50:09,860 --> 14:50:13,999
Enrico Fermi Institute: other comments here. Or should we move on to to network integration?
1003
14:50:21,620 --> 14:50:24,489
Enrico Fermi Institute: Okay, Sounds like we should we should move on
1004
14:50:24,870 --> 14:50:25,860
um
1005
14:50:26,190 --> 14:50:27,260
Enrico Fermi Institute: screen.
1006
14:50:30,210 --> 14:50:51,389
Enrico Fermi Institute: So yeah, one of the things we wanted to talk about was uh, you know just how how our sheer ones your choose are connected uh today, and sort of what the the the the plans are for that in the future. Um! Some of the some of the forward-looking stuff, um! And then we'll also have a presentation from from uh dale cart vesnet
1007
14:50:51,400 --> 14:50:59,409
Enrico Fermi Institute: um to give us some of his thoughts as well. Yeah, One of the questions that comes up here is
1008
14:51:00,220 --> 14:51:08,870
Enrico Fermi Institute: especially with the clouds. What can we do about connecting things to Lhs. You want and hearing all this business
1009
14:51:08,990 --> 14:51:14,459
Enrico Fermi Institute: people like to talk about address costs. Is there anything any quick and easy thing we can do to reduce those
1010
14:51:14,770 --> 14:51:15,980
Enrico Fermi Institute: um.
1011
14:51:16,900 --> 14:51:19,919
Enrico Fermi Institute: So for site, connectivity? Um
1012
14:51:19,990 --> 14:51:32,529
Enrico Fermi Institute: for Cms one hundred gigabit all the your two sites turn our gate with to to Fermi lab evolution of the of us-based site connectivity. There's plans to demonstrate
1013
14:51:32,690 --> 14:51:46,870
Enrico Fermi Institute: over a hundred gigabit uh transfers of two thousand and twenty-three sensitive plans to have tier two's at four hundred gigabits in two thousand and twenty-five uh for me. Lab has plans for upgrades, but they're taking sort of a year by your approach. I don't know dirt if you want to add anything else to that.
1014
14:51:48,140 --> 14:51:59,569
Dirk: No, that's basically it. I mean it. It's a little. All these plans are kind of tended. If we know we have to upgrade to get to H. And let it see, and it's going to be a process. But the exact schedule is a bit
1015
14:51:59,660 --> 14:52:02,299
Dirk: and and undefined at the moment,
1016
14:52:02,330 --> 14:52:05,130
Enrico Fermi Institute: and I should say that a lot of these
1017
14:52:05,250 --> 14:52:07,310
Enrico Fermi Institute: plans were
1018
14:52:07,500 --> 14:52:09,719
Enrico Fermi Institute: not said so, but there are.
1019
14:52:10,130 --> 14:52:29,889
Enrico Fermi Institute: The plans were developed before the slip of the La, the agency schedule. So, um, you know i'd be willing. We're already talking about maybe pushing the the demonstration of greater than one hundred greater than one hundred gigabit transfers to twenty-four. Um. So now that we have a couple of more years, we're probably going to shift things back a bit
1020
14:52:32,770 --> 14:52:45,550
Enrico Fermi Institute: on the Atlas side. Yeah, So we have a hard view, but it says a few, but it's really most of the tier two are basically at or near one hundred gigabits, some somehow more than
1021
14:52:45,560 --> 14:52:59,419
Enrico Fermi Institute: hundreds of two by one hundred things like that. Um, The tier one, so I understand, has at least four by one hundred gigabit. So if i'm i'm just representing any of the sites. Just jump out and correct me. Um. And yeah,
1022
14:52:59,430 --> 14:53:15,340
Enrico Fermi Institute: our expectation is that, you know, in the future we'll we'll have multiple hundred gigabytes of connectivity. Um, you know, one or more site may have uh four hundred gigabit links that I think a lot of it depends on the on the economics of uh of when it's sensible to. Uh to start buying four hundred.
1023
14:53:17,980 --> 14:53:24,780
Enrico Fermi Institute: Yes, that plans. Yeah. So I think we now we can. We can jump to to Dale's presentation. If you're you're out there. Jail.
1024
14:53:25,670 --> 14:53:31,499
Enrico Fermi Institute: Yes, sounds good. Okay, great. I'm going to stop share here and you can. You can start your share.
1025
14:53:35,730 --> 14:53:49,210
Dale Carder: All Righty. All thanks for having me here today and feel free to to interrupt. I like this interactive approach a lot more than me, just uh preaching. So it's kind of got um an overview of
1026
14:53:49,220 --> 14:53:56,039
Dale Carder: sort of the do we? Uh networking perspective on Hpc. Facilities. Tier one,
1027
14:53:56,070 --> 14:53:59,089
Dale Carder: and then we'll get into some cloud stuff, and then
1028
14:53:59,430 --> 14:54:05,269
Dale Carder: then I sort of trail off into where I have more questions than answers, which is, I guess, not surprising,
1029
14:54:05,460 --> 14:54:08,239
Dale Carder: given What? Where some of these conversations have been.
1030
14:54:08,740 --> 14:54:15,859
Dale Carder: So the biggest thing I want to emphasize with respect to not just where we are now. But you know,
1031
14:54:16,190 --> 14:54:35,530
Dale Carder: through sort of the timeline between now and the beginnings of high luminosity. Lhc: And we we had this big, you know, uh process to build. Yes, net six, and some of the key components included, building our physical network into each Doe National lab,
1032
14:54:35,670 --> 14:54:48,760
Dale Carder: and that means our fiber extends in there with our equipment collocated at the site um with Routers that we run there, so we can offer essentially any. Yes, net service at any national lab at full scale.
1033
14:54:49,880 --> 14:55:08,229
Dale Carder: It's also. Now, the Esnet owns the optical equipment, and basically the end to end connectivity extremely cost effective uh to upgrade it. It's not going out and procuring uh circuits from vendors. You know things along that line. We're doing all of our optical engineering in house
1034
14:55:08,240 --> 14:55:14,650
Dale Carder: now, so we can go out and buy modems from any vendor off the shelf, and to put them under our network after we qualify them.
1035
14:55:14,820 --> 14:55:24,800
Dale Carder: So it's sort of a very different evolution model from sort of the traditional backbone approach of buying circuits and and linking things together on hot by Hop!
1036
14:55:25,670 --> 14:55:27,000
Dale Carder: Um!
1037
14:55:27,070 --> 14:55:40,660
Dale Carder: There was already a little bit showing of sort of like where we were at connectivity. Wise um for each of the Lcfs and nurse basically we're right now at this precipice of going from
1038
14:55:40,730 --> 14:55:59,839
Dale Carder: um and by one hundred get connectivity to four hundred gig and class connectivity. Um sort of everyone's got a little bit slightly different timeline depending on on large part due to equipment, shortages and things of that sort, but generally across the the big deal we facilities. This is all kind of happening in parallel.
1039
14:55:59,880 --> 14:56:17,669
Dale Carder: Um, There's yesterday. There's a lot of talk about nurse being sort of different than the Lcs. Which is fair. Um, they're targeting one terabit per second um basically into their their facility, and that's that's not through um the lab that's direct. And for me it's not to nurse
1040
14:56:18,780 --> 14:56:33,380
Dale Carder: where I think this puts us, and you know we're like. At least I want to be is that the limiting factor is going to be at the site, you know, if we can basically show up to the door of Fermi Lab, or show you the door of
1041
14:56:33,470 --> 14:56:37,449
Dale Carder: nerves wherever with essentially all you can eat connectivity you,
1042
14:56:37,800 --> 14:56:41,620
Dale Carder: it's now onto the Border Router security junk
1043
14:56:41,990 --> 14:56:44,800
Dale Carder: data transfer nodes and storage
1044
14:56:45,030 --> 14:56:49,839
Dale Carder: where the scaling factors are going to be. Not necessarily the wide area now.
1045
14:56:50,640 --> 14:56:56,309
Dale Carder: So that's that's sort of where where I think we're going to be, at least in the next
1046
14:56:56,520 --> 14:57:01,299
Dale Carder: couple of years. We've got a long life cycle on, especially the optical network that we've built
1047
14:57:02,380 --> 14:57:05,220
Dale Carder: there any questions sort of on this front,
1048
14:57:05,860 --> 14:57:08,820
Dale Carder: and I think we will drift off into cloud stuff.
1049
14:57:10,580 --> 14:57:15,270
Enrico Fermi Institute: So when is the four hundred Gigabit stuff expected to become
1050
14:57:20,030 --> 14:57:37,240
Dale Carder: economical. So it's a funny term, but uh, it's almost more about availability right now. Can you buy equipment or not? And in some cases you can actually only buy the newer equipment because it's It's smaller uh fab sizes that are actually being produced
1051
14:57:37,250 --> 14:57:51,110
Dale Carder: versus the larger tabs where you're competing with chips for dishwashers and things like that. So it's it's It's sort of this funny funny point. But in our in our conversations with
1052
14:57:51,300 --> 14:58:06,020
Dale Carder: um I think we're up to like sixty or seventy of the tier two's. Nearly everyone has a has a plan for the next couple of years. It's either like next year or or right after that. So we're pretty much right at that point. Now,
1053
14:58:06,730 --> 14:58:24,369
Dale Carder: a lot of that's driven by. You know the economics of these major um cloud data centers. So if you can buy equipment, sort of, you know, matching what the industry as a whole is buying, you're going to reap the words of that cost effectiveness. There
1054
14:58:25,910 --> 14:58:41,179
Enrico Fermi Institute: is. Is there a concern that, like I know it doesn't apply for a lot of sites, but for for things like uh like firewalls and things like that. I know a lot of some. Some sites are more concerned about that than others like are the firewall appliances sort of keeping pace with the
1055
14:58:41,580 --> 14:58:54,109
Dale Carder: i'll say no. I don't think there's truly been a demonstrated track record of that.
1056
14:58:54,420 --> 14:58:56,780
Dale Carder: You know we still see
1057
14:58:57,000 --> 14:58:59,720
Dale Carder: traffic compound what
1058
14:59:00,380 --> 14:59:03,320
Dale Carder: forty ish annually
1059
14:59:03,380 --> 14:59:22,169
Dale Carder: Um! Those firewalls and middle boxes are designed for typically administrative workloads where end of the day there's only so much data where all you guys sitting on your laptops in the conference room are gonna be competing for resources. That's very different. Right then. Uh it's scientific computing.
1060
14:59:22,180 --> 14:59:27,799
Dale Carder: So there's things you know. Yes, kind of sort of worked on on that team such as like the Science team, Z model for
1061
14:59:27,830 --> 14:59:33,430
Dale Carder: you know how to place resources at a site, how to change the perimeter architecture to better accommodate the
1062
14:59:33,540 --> 14:59:49,789
Dale Carder: um data intensive sciences. So there there's there's opportunities there. But you know I I still don't see a world where you would go for a or could cost effectively deploy and off the shelf like Firewall Middlebox.
1063
14:59:50,840 --> 14:59:51,830
Okay,
1064
14:59:52,530 --> 15:00:11,389
Dirk: I'd love to be proven wrong. So can please do. Yeah, yeah, I had to comment on the the last line on this slide where you said Vice- white disparity in hp support for data centric workforce. I mean. We discussed that yesterday to a lot, and we know where I was
1065
15:00:11,400 --> 15:00:14,199
Dirk: curious was If this actually
1066
15:00:14,240 --> 15:00:17,059
Dirk: has an impact on how the
1067
15:00:17,370 --> 15:00:33,970
Dirk: these Hpc. Facilities approach that building up their external connectivity, or if they just if that doesn't matter, they're still going for full connectivity to the to the data transfer notes, at least, even if they don't, they don't have the the like nurse where they want to support like
1068
15:00:41,960 --> 15:00:53,050
Dale Carder: right, so it's helpful for me to think about this in terms of procurement life cycles, because I think the the Lcfs are also very much in that world
1069
15:00:53,060 --> 15:01:08,649
Dale Carder: where you go out and you survey the user community for needs come up with a list of, you know, used cases that you're going to support, and you take the doe as part of Cd. Zero and say, Here, here's the mission need of what we do. Go into alternatives, analysis, and so on.
1070
15:01:08,970 --> 15:01:16,169
Dale Carder: And then, five years later a machine shows up right on the dock right and and gets installed.
1071
15:01:16,180 --> 15:01:32,919
Dale Carder: So it's really about being ahead of that, and then yes, networ in the exact same boat. So when we built the S. That's exactly the process we went through, beginning five six years ago, and here we are what they're going to have, like our official on grand unveiling next month.
1072
15:01:33,890 --> 15:01:52,230
Dale Carder: Uh so in like, on the yesness side of the world, we've had these requirements, Reviews. So many of you here participated in the in the Requirements Review for for ha! We're currently doing one now for basic energy sciences, and this goes directly into our, You know, longer term procurement, forecast budgets
1073
15:01:52,240 --> 15:02:01,709
Dale Carder: and things of of that nature, so that a we don't over build, you know, and spend a lot of taxpayer resources um way too early,
1074
15:02:01,730 --> 15:02:07,559
Dale Carder: nor get caught on the other end, far, far behind from what where the needs lie.
1075
15:02:07,650 --> 15:02:13,420
Dale Carder: This is essentially, we solve this on our end through just constant communication, and
1076
15:02:13,610 --> 15:02:21,599
Dale Carder: beating people up like Andrew Mellow for for status as to what's going on, and and making sure that we're in lockstep.
1077
15:02:25,500 --> 15:02:41,649
Dale Carder: So for for nurse like I said yesterday. Um like there has been a we're doing a requirements review for for basic uh energy sciences. And in there will be a case study for Lcls to, I think, and how that
1078
15:02:41,810 --> 15:02:47,979
Dale Carder: that operation at Slack is going to be integrated with nurse because they're talking again. Terabytes
1079
15:02:48,100 --> 15:02:53,969
Dale Carder: uh workflows from the beam line to compute, and then autonomous steering back.
1080
15:02:54,240 --> 15:03:00,560
Dale Carder: So there's There's things there that that could be of relevance to this group to see how other groups are
1081
15:03:00,760 --> 15:03:02,240
Dale Carder: are sort of handling it.
1082
15:03:07,670 --> 15:03:21,620
Enrico Fermi Institute: So all right. So well, yeah, do you have one more? I'll do one more, and and if this is covered in another slide. Feel free to defer. Uh. But what's What's the yes net thinking on on on caching in the network.
1083
15:03:22,050 --> 15:03:31,530
Dale Carder: Yeah, I'll. I'll have just a bullet on that. Yeah, we can. We can kind of open it up there as I get into the more.
1084
15:03:31,950 --> 15:03:36,629
Dale Carder: Yeah, So let's talk about clouds. So there's sort of
1085
15:03:37,500 --> 15:03:40,180
Dale Carder: the terminology around. Cloud stuff is like
1086
15:03:41,210 --> 15:03:44,880
Dale Carder: amazingly hard to comprehend, because every vendor has their own
1087
15:03:45,130 --> 15:03:54,259
Dale Carder: proprietary language, and they'll use the same words, and none of them are like actually descriptive of what's going on. But let's lump that into two bins,
1088
15:03:54,320 --> 15:03:56,470
Dale Carder: Public cloud and Private cloud.
1089
15:03:56,820 --> 15:04:08,730
Dale Carder: Public cloud is. You know what happens when you were to just log into an Ec. Two, console and and fire up a Vm. And you're going to get a network that's essentially, you know, to be this public facing
1090
15:04:08,790 --> 15:04:09,970
Dale Carder: um,
1091
15:04:10,500 --> 15:04:17,009
Dale Carder: and you know the those egress charges we you know, keep hearing about, apply, and things of that nature.
1092
15:04:17,210 --> 15:04:18,940
Dale Carder: Private cloud.
1093
15:04:19,090 --> 15:04:33,729
Dale Carder: This is where you would be um standing up, you know multitudes of um, instances of compute with some private back end network, and then that private back end network has some sort of, you know, egress
1094
15:04:34,030 --> 15:04:51,190
Dale Carder: uh delivered through a multitude of means. But then that it has to connect to something out right right. It's fully self-contained. Uh, So you have to either connect back to your home institution or you some tunneling technology Uh, optionally. You can bring your own Ip, addressing Uh.
1095
15:04:51,480 --> 15:05:10,629
Dale Carder: These typical workloads are administrative computing, so say, University of Chicago. Wanted to put the Hr system in in the cloud and keep it on the University of Chicago Network, as it Hr: Data, This is the technology would use. And you know, this should be like I should put this involved, but like
1096
15:05:10,650 --> 15:05:18,269
Dale Carder: it's very expensive. And we're talking about data rates, you know, commiserate with administrative computing, not research computing.
1097
15:05:18,660 --> 15:05:34,160
Dale Carder: And that's why you see, software Routers um software appliances doing these Cpms. So they've come up with um in addition to just multiple ways to extract money from you different ways to work around these limitations. So
1098
15:05:34,250 --> 15:05:43,919
Dale Carder: if you're beyond the scale of what you can get away with with, you know, using a software based router and software. Based, you know, vpning traffic back to an institution.
1099
15:05:44,200 --> 15:05:58,230
Dale Carder: There's uh dedicated interconnects that these are like, essentially charged by the hour uh connections. That's why I tried to put this like city going to the restaurant. This is the four dollars sign, you know. Uh menu option.
1100
15:05:58,400 --> 15:06:13,059
Dale Carder: Um, you have Cloud Exchange, which was sort of some where you'd have like this uh intermediate broker, managing like the physical infrastructure for you. We have some of these today on Yes, now we're working to deprecate them because They're the three dollars same level,
1101
15:06:13,930 --> 15:06:24,750
Dale Carder: and we're replacing them with this partner interconnection model, which is where you go out and you procure. And by you I mean like Yes, net goes out and procures a middle man
1102
15:06:24,860 --> 15:06:33,130
Dale Carder: to handle this sort of like interconnection and get away from the hourly charge. Port charges um to the various entities
1103
15:06:33,290 --> 15:06:39,740
Dale Carder: and throw some, you know, virtualization on top of that, and come out the only the two dollar sign approach.
1104
15:06:40,220 --> 15:06:42,260
Dale Carder: But again, these are still it
1105
15:06:42,400 --> 15:06:55,590
Dale Carder: humble data rates. Um! There's to put actual money on here. It's It's nearly impossible to. Uh, you know. Figure out what these things cost like. You need a, you know a used to car salesman to help you
1106
15:06:55,690 --> 15:07:09,500
Dale Carder: uh figure this out, so putting that into like, Where are we today? Um connectivity wise. So in the public cloud realm again. Uh, if you' to stand up, you know
1107
15:07:09,640 --> 15:07:24,150
Dale Carder: you know random sets of machines. This is sort of the connectivity we have, which is, you know, three hundred connections to major markets for Google, six connections to Oracle, five to Amazon, five to Microsoft.
1108
15:07:24,160 --> 15:07:35,560
Dale Carder: And these are basically there and ready to go um, such as it's been mentioned earlier on, like Fermi labs being able to take advantage of the Google connectivity um
1109
15:07:35,840 --> 15:07:42,449
Dale Carder: on a couple of occasions. Now, most recently, I think, in October, when last October, when there was that inference training on
1110
15:07:43,390 --> 15:07:51,479
Dale Carder: um These are are very, very cost-effective. Such that we pay for these essentially out of the operating budget to the
1111
15:07:51,860 --> 15:08:07,890
Dale Carder: So this is just our cost to doing business. We shared across all, do we? On it's. It's not a big problem, because what we do much like we built and teach the national labs we built. Yes, net six into the major commercial facilities. So we're there.
1112
15:08:07,900 --> 15:08:15,040
Dale Carder: So a lot of these connections is just a jumper across the building. You know that kind of thing from our network to that that network There, go ahead.
1113
15:08:17,290 --> 15:08:23,250
Dirk: But this this basically this doesn't give you a cost advantage. It just gives you capabilities. Right?
1114
15:08:23,260 --> 15:08:46,970
Dirk: Yep, exactly. But this especially as Google. This matches very well with their uh, their flat, you know. Subscription model. Yeah, I mean, yeah, you still have the normal cost. So if you go. If you go just on demand, you just pay normal ecosystems. You just have to fast data connect there that we that you can actually run your workflows and then subscription. Okay, if you get rid of egress and you can, of course, use it fully. Okay. Thanks.
1115
15:08:46,980 --> 15:08:54,459
Dale Carder: Yeah, exactly. I think Oracle also may grave egress fees. I forget who is using that. Uh in in do we
1116
15:08:57,180 --> 15:09:14,340
Enrico Fermi Institute: so so quickly, though so to take advantage of this? If I were to log on the Ec. Two, and I've landed, and I guess the right availability. So I don't need to do anything special to jump on to. If i'm moving data from somewhere in Amazon to somewhere connected to Esnet. Six
1117
15:09:14,350 --> 15:09:18,790
Enrico Fermi Institute: uh to me is the quote unquote user I don't have to do anything special to
1118
15:09:19,130 --> 15:09:32,770
Dale Carder: right, and it's it's both. This whole slide probably applies both to Esnet and to Internet. To I think we're probably nearly identical and capabilities in this regard, because we it's just easy to scale up as as usage
1119
15:09:32,950 --> 15:09:38,519
Dale Carder: is in place. One thing i'll point out, though you know, in these direct connections to these peers.
1120
15:09:38,560 --> 15:09:42,939
Dale Carder: There is like human to human level negotiation to get these into place.
1121
15:09:42,970 --> 15:09:48,749
Dale Carder: So, for example, it took months to connect to Google. You know, they said, well how much you're going to use, and we're like, I don't know all of it,
1122
15:09:49,050 --> 15:09:57,379
Dale Carder: right. They were like, yeah, whatever. And then what we do we use all of you know, all of their gpus, for example, because we can.
1123
15:09:57,410 --> 15:10:10,549
Dale Carder: Um, These providers are much more used to like diurnal traffic flows uh like you would see with, you know, commercial users during the day, and you know, residential users at night. Um, so there's like
1124
15:10:10,700 --> 15:10:15,559
Dale Carder: to get these in place does require some negotiation and some long-reach plan,
1125
15:10:16,170 --> 15:10:19,010
Dale Carder: because we have to talk them into it and prove we're going to use it
1126
15:10:21,010 --> 15:10:23,160
Dale Carder: hollow. I see you've got your hand up.
1127
15:10:23,270 --> 15:10:38,420
Paolo Calafiura (he): Yeah, I I It was a kind of a question already asked, but and then and then another one, so I I believe that there was also some uh peeing agreement with on the right if if we use your your boxes. But
1128
15:10:38,430 --> 15:11:05,940
Paolo Calafiura (he): some discounts are you guys with up? If I recall correctly the Amazon. One is something more like if you use X amount of compute, some percentage of that can be. Yeah, yeah, exactly. Something on that. And then, just out of curiosity. Why do you have the most boxes to? Or a call Is Is Is that the just because it happened? And they were easy to deal with, or because there is a use for use it for a long
1129
15:11:05,950 --> 15:11:25,680
Dale Carder: um. There's almost certainly someone going to use it like do is very, very big. So between uh office of science and an essay, and all the other stuff going on. Um, there's also, uh, you know, the the doe uh Federal network itself, which is now overlay on. Yes, net
1130
15:11:25,690 --> 15:11:30,750
Dale Carder: sort of That's the nice thing about Once you hit the scale, we can kind of share the economics of this,
1131
15:11:32,400 --> 15:11:52,140
Dale Carder: and then quickly I go over the the private cloud interconnects. Uh So this is where we have uh we're pretty into place actually, as we speak uh, tear a bit of connectivity to a third party called fabric packet fabric, and then they go through and punch uh uh physical connectivity into each of the vendors
1132
15:11:52,150 --> 15:12:07,749
Dale Carder: for the private cloud hosting that will replace things like we had previously with uh the Cloud Exchange product. So again, that's It's a bit more um, you know, targeting administrative workloads. But
1133
15:12:07,790 --> 15:12:13,040
Dale Carder: as we get into talking about Lhc. One, maybe it's a model that could be used there, too. I don't know
1134
15:12:13,120 --> 15:12:14,430
Dale Carder: uh Dirk,
1135
15:12:17,590 --> 15:12:21,010
Dirk: I think Fernando was first. If he wants to go.
1136
15:12:24,700 --> 15:12:31,579
Dirk: No, now he will lower the sand. I I just had a quick question. So sorry. Sorry I was not. I was on mute,
1137
15:12:31,590 --> 15:12:48,210
Fernando Harald Barreiro Megino: and I still didn't get over the public cloud section. So if i'm uh I need to be on Google on the Availability Zone, Seattle on the region Seattle, Chicago, or Nyc. In order to
1138
15:12:48,450 --> 15:12:54,120
Fernando Harald Barreiro Megino: my trousers, go through Esnet.
1139
15:12:54,640 --> 15:13:11,879
Dale Carder: Every vendor is different with Google. I think they announce, or they they will haul traffic regardless of where it ingresses or egresses. Their network. Amazon is the exact opposite where you have to send it to traffic to the exact zone. So everyone it
1140
15:13:11,890 --> 15:13:18,989
Dale Carder: All these systems are proprietary in that regard, and you, unfortunately kind of have to know in advance what you're walking into,
1141
15:13:24,480 --> 15:13:31,589
Fernando Harald Barreiro Megino: and there is a transit to I mean Southwest you to or anywhere at some university, and
1142
15:13:31,630 --> 15:13:34,459
Fernando Harald Barreiro Megino: the Us. That will
1143
15:13:35,020 --> 15:13:40,880
Fernando Harald Barreiro Megino: go through the in, I mean, through the normal Internet will not end up in years now, right
1144
15:13:41,730 --> 15:13:51,500
Dale Carder: right? So for Google, that'd be the case for Amazon, Where? Yes, that does not appear with Amazon in Europe we would probably never see the traffic until it shows up through whatever
1145
15:13:51,750 --> 15:13:53,490
Dale Carder: all their path exists.
1146
15:13:53,590 --> 15:13:54,440
Okay,
1147
15:13:54,860 --> 15:13:57,260
Fernando Harald Barreiro Megino: Okay, thanks.
1148
15:13:58,290 --> 15:14:08,890
Dirk: Okay, And I had a question yesterday. That was when I, when we talked briefly about Lanceium. There was a I remember, from talking with them that they said they had plans to peer with.
1149
15:14:08,900 --> 15:14:24,610
Dirk: I think it was the snet. Are you aware of anything? I mean? I think they're still building the data center? So i'm not sure at what stage they are with that on that. But our general, our our peering policy, is relatively wide open songs. We can justify it
1150
15:14:24,720 --> 15:14:34,139
Dale Carder: uh so in any in new market entrance which should not be a barrier on the network as long as they show up at essentially any
1151
15:14:34,260 --> 15:14:44,270
Dale Carder: major uh co-location facility where networks come and meet together. So, for example, we're in Houston we're in dallas we're in El Paso I mean kind of their neck of the woods.
1152
15:14:46,180 --> 15:14:48,409
Dale Carder: So that question is very easy.
1153
15:14:53,330 --> 15:14:57,830
Dale Carder: All right, all right. Um, This is sort of more like the
1154
15:14:58,240 --> 15:15:00,100
Dale Carder: I dumped all the other stuff here.
1155
15:15:00,180 --> 15:15:07,480
Dale Carder: Um. So some other things. Yes, that has um that sort of just worth having on your laundry list of things to know. It exists.
1156
15:15:07,620 --> 15:15:10,210
Dale Carder: Um, one is, you know,
1157
15:15:10,530 --> 15:15:29,139
Dale Carder: Api's. And you know, dynamic requesting of resources is something that Yes, that's um long since supported for layer, two circuits, including uh on bandwidth scheduling uh on demand um and prioritization. That is how the Ic. Op. And
1158
15:15:29,150 --> 15:15:32,900
Dale Carder: uh circuits are instantiated between it's your zero and the tier one's
1159
15:15:33,160 --> 15:15:42,740
Dale Carder: um also sort of in-flight is um dynamic layer three instantiation works internally the snet. We actually used it for lsst
1160
15:15:42,840 --> 15:15:47,320
Dale Carder: uh between slack and uh, the the South American networks.
1161
15:15:47,570 --> 15:16:07,369
Dale Carder: Um. It's completely conceivable to open that up also, and that could be used as a way if you wanted to. Dynamically, uh, you know, acquire cloud resources at the Api endpoint and and fire it up. Um! So these things are very much like near reality. Showed a used case. Justify for their development.
1162
15:16:07,500 --> 15:16:19,190
Dale Carder: Um, there's an R. And D project underway with um Russia and integration with our framework called sense, which is again a more on uh, you know, dynamic network path, provisioning
1163
15:16:19,610 --> 15:16:24,199
Dale Carder: across the
1164
15:16:24,920 --> 15:16:26,660
Dale Carder: what you call it on
1165
15:16:27,930 --> 15:16:40,060
Dale Carder: uh potential sort of sort of kicking off now. Um, where um sort of like the nurse super facility concept is sort of making it, you know the logical next sleep
1166
15:16:40,620 --> 15:16:55,390
Dale Carder: uh internal to Yes, and we now have um Fpga experience in house. So we have uh been working on some projects where we're using Fpga's um to accelerate uh different
1167
15:16:55,450 --> 15:17:11,650
Dale Carder: the sort of like used cases we've seen um from like uh triggers to compute uh and load dynamically, load bouncing and hardware, or working on something like that for J. Lab. I think there is also a similar effort underway between uh
1168
15:17:11,760 --> 15:17:13,539
Dale Carder: Als and nurse
1169
15:17:14,050 --> 15:17:18,290
Dale Carder: um. In addition, those Fpgas can be used to, you know,
1170
15:17:18,420 --> 15:17:28,050
Dale Carder: in in my like crystal ball out like we think about. If we hit. You know the scaling limits of cpus. It also probably means we'll end up hitting the scaling limits of Tcp.
1171
15:17:28,150 --> 15:17:39,700
Dale Carder: Um. Someone smarter than me is probably already figured out when that exists. But, uh, we're we're sort of ready for that era with the ability and yes net code running on Fpga's today
1172
15:17:40,630 --> 15:17:48,639
Dale Carder: on the more operational side of the house. Um, we've got an R. And D project underway and deployment of uh packet marking
1173
15:17:48,650 --> 15:18:07,750
Dale Carder: uh. So using annotations in like the Ipd six packet header to identify uh what workload is running, and then reporting that back out from like an accounting perspective of you know what science, domain, and activity is on a particular link to something that's been. That'll be pretty useful for for
1174
15:18:07,760 --> 15:18:12,620
Dale Carder: planning, you know, capacity, planning, traffic, engineering, those sort of use cases.
1175
15:18:13,490 --> 15:18:27,130
Dale Carder: And then here's my catch all for, uh, you know X cash. So uh I I think it's certainly a a promising future of more integrated. Uh, you know,
1176
15:18:27,230 --> 15:18:39,629
Dale Carder: caching or even bigger picture storage on or in the network, and to better use the resources available to us, for example, latency hiding as being sort of an easier use case.
1177
15:18:39,640 --> 15:18:49,580
Dale Carder: And then uh, I think there's currently cashes There's one in California. I don't know the status. There might be one in Chicago, and also one plan for Boston.
1178
15:18:50,090 --> 15:19:01,929
Dale Carder: But it seems to me, you know, my engineer approach, for, like Guy, who doesn't make the decisions is that seems pretty straightforward, and something we should continue to work on.
1179
15:19:02,990 --> 15:19:21,869
Enrico Fermi Institute: Yeah, go ahead. No, I was gonna say i'm sorry. I thought you were gonna go on to the next slide. I was going to say i'm just rambling. Um, Could you talk a bit more about this uh the the layer three vpn instantiation. So So who this would be? Science expanding um
1180
15:19:21,890 --> 15:19:41,860
Dale Carder: their their vpn from from from whom? To whom? I guess you you know I mean how I build things. It's the support from anywhere to anywhere. So it's It's nebulous. Um, because it's you know it's a generic framework. So the idea being you've got site eight and site Z. They Wanna and and site F.
1181
15:19:42,270 --> 15:19:51,650
Dale Carder: They could create a private network overlay on the for that activities. Um, traditionally, that was something very hard to do. You'd have to go around, and, you know, signal up circuits, or
1182
15:19:51,700 --> 15:19:59,089
Dale Carder: do all this work Now it would look much more like, hey? Here's a vlan and can actually router into it, and it will just get to the other side,
1183
15:19:59,120 --> 15:20:09,500
Dale Carder: and it's completely private. It's the same technology that cloud providers are using on the back end for their virtual private networks. So if you guys, they're using the same kind of thing.
1184
15:20:09,900 --> 15:20:17,080
Enrico Fermi Institute: So so basically I I could hit some Api on on on the all side. And
1185
15:20:17,090 --> 15:20:42,269
Enrico Fermi Institute: you would say, Okay, you connect to V Vlan Number five hundred and twenty-three, and the other one connects to. I don't know six hundred and seventy-two, and that vlan on your on your end is gonna um The vlines are, you know, tunneled together? You. You handle stitching together the layer two surface, or whatever it takes to get from point. Yeah, or even layer three circuits. So you've got full resiliency within Continental us, and that kind of thing.
1186
15:20:42,290 --> 15:20:45,820
Dale Carder: So yeah, it's pretty promising. Uh,
1187
15:20:45,940 --> 15:21:03,750
Dale Carder: I think. Just need more exploration of what the used cases are there like. We built it for ourselves, but there's nothing preventing that um and sort of how it was designed to to take the setup. One of these circuits takes takes longer to fill out the form in our database,
1188
15:21:03,760 --> 15:21:21,099
Enrico Fermi Institute: Gotcha and does this, he said. Anyone anyone so I could in in potentially set this up, uh, you know, at Vayner built and have the other end be a cloud provider. Yeah, that's that's what i'm thinking could be a popular use case
1189
15:21:21,110 --> 15:21:25,710
Dale Carder: right? And and maybe even wanted to have a second cloud provider. I mean, that's It's totally doable.
1190
15:21:25,820 --> 15:21:32,929
Enrico Fermi Institute: Okay, Yeah, that definitely, I can definitely think of a a few interesting things you could do with that.
1191
15:21:33,340 --> 15:21:38,839
Dale Carder: It it's something that again it's sort of like. Let's plant the seed of a you know, capability that exists,
1192
15:21:39,060 --> 15:21:41,780
Enrico Fermi Institute: and see if there's a a good use for it.
1193
15:21:42,680 --> 15:21:44,940
Enrico Fermi Institute: Oh, thank you. Yeah,
1194
15:21:46,080 --> 15:21:52,379
Dale Carder: okay. And then here's where we drift off from the the known to the the less known.
1195
15:21:52,410 --> 15:22:02,250
Dale Carder: So this in thinking about sort of these facilities as part of a greater ecosystem we've covered the do we space? Well, because, like Bonnie Hasn't,
1196
15:22:02,370 --> 15:22:16,099
Dale Carder: Now, if you think about the Nsf. Hpc. Sites in particular, there's it's even more disparate as to their to their connectivity and capabilities. So some sites like off the top of my head
1197
15:22:16,860 --> 15:22:27,859
Dale Carder: uh San Diego um, and do extremely well connected like wouldn't wouldn't worry about them. Um, because typically they have like they own their infrastructure. Um!
1198
15:22:27,880 --> 15:22:33,060
Dale Carder: Ncsa is another one where moodles of network connect to it exists,
1199
15:22:33,270 --> 15:22:44,040
Dale Carder: but then there's other centers i'll, you know, unfortunately, like I think, is in the scenario, where, like their machine, is like often some business park outside of town or
1200
15:22:44,290 --> 15:22:49,719
Dale Carder: and it and there's not necessarily good connectivity to the for a data centric workflow.
1201
15:22:49,820 --> 15:22:56,510
Dale Carder: So if you're thinking about running more on Nsf. Hpc. Facilities, you need to have a facilitation with
1202
15:22:56,690 --> 15:23:01,789
Dale Carder: the sites you're thinking about to answer some key questions of. Can you get your data in and out
1203
15:23:02,030 --> 15:23:04,070
Dale Carder: uh in a production fashion?
1204
15:23:04,380 --> 15:23:08,250
Dale Carder: Because it's There's a huge disparity between sites
1205
15:23:09,720 --> 15:23:13,179
Dale Carder: now on the Us. Side. Um,
1206
15:23:13,510 --> 15:23:24,780
Dale Carder: We covered some of that um just before I started. But what of note? Yes, and that is talking to every single Us tier, two site basically in preparation for high luminosity.
1207
15:23:25,070 --> 15:23:32,899
Dale Carder: As such we were sort of like getting a good view as to where the the universities are, with their regional networks,
1208
15:23:33,370 --> 15:23:41,009
Dale Carder: and in general, I think, with enough prior planning which was our goal. The outlook continues to be good.
1209
15:23:41,080 --> 15:23:44,200
Dale Carder: Um! But we need to keep that facilitation game up
1210
15:23:44,300 --> 15:23:56,660
Dale Carder: uh and make sure that you know for especially universities that have one or two intermediate networks between them and yes, another internet too, that everything upgrades and lockstep, or we can't connect these things together,
1211
15:23:57,680 --> 15:24:00,260
Dale Carder: so that that present looks
1212
15:24:00,450 --> 15:24:18,510
Dale Carder: good, and the key to making this work from my perspective is the data challenges a thing where we can point to and say By this date it has to work as follows: The The data challenges are are going to be the the forcing function that the the community uses the for internal justification. The,
1213
15:24:18,520 --> 15:24:30,059
Dale Carder: you know, show their administration like the you know, the the pro or whatever like. Hey, We do need this stuff, and and here's where we're on. We need it by, and that that program is finally important.
1214
15:24:32,030 --> 15:24:38,670
Dale Carder: Now, on to the the perhaps more questioning stuff on my part, which is
1215
15:24:38,700 --> 15:24:46,129
Dale Carder: this community has a network called Lhc. One which is sort of called a whole nother Internet connecting just the
1216
15:24:46,150 --> 15:24:49,229
Dale Carder: resources together that exclusively
1217
15:24:49,350 --> 15:24:53,840
Dale Carder: you know work on these large-scale projects for Lhc
1218
15:24:54,160 --> 15:25:08,620
Dale Carder: So in the Us. You've got um Us Cms and us Atlas sites the tier one's in the tier, two centers connected to Lc. One, and then yes, net has transatlantic connectivity where we connect to our our peer networks in the Eu
1219
15:25:09,200 --> 15:25:13,050
Dale Carder: again to the major tier, one into your two centers.
1220
15:25:13,580 --> 15:25:16,670
Dale Carder: On those networks there is, you know,
1221
15:25:17,460 --> 15:25:28,839
Dale Carder: for better or worse. Ip addresses are used as authorization tokens for what traffic can go on to that network, because that network has an acceptable use policy defining what can and can't be on it.
1222
15:25:29,270 --> 15:25:34,900
Dale Carder: Uh, namely, it's exclusive. It's for exclusive use of obviously traffic.
1223
15:25:35,250 --> 15:25:46,019
Dale Carder: Now, in the case where you've got a dedicated facility and all it does. Or Maybe you have dedicated Dtn machines, and all they do is um, you know, traffic that's Top related.
1224
15:25:46,070 --> 15:25:54,429
Dale Carder: It's pretty straightforward when you start thinking about cloud resources, or even some of the bigger clusters, even seen like an open science grid.
1225
15:25:54,440 --> 15:26:08,900
Dale Carder: These are multi-science uh compute nodes. And we talked to our our peers at Brooklyn. This is already happening there where they they have cluster, can run any job but this restriction of what traffic can go over. Lhc. One
1226
15:26:09,470 --> 15:26:13,799
Dale Carder: a limiting factor, because now the source Ip address of the Node banners
1227
15:26:13,910 --> 15:26:16,910
Dale Carder: and trying to adhere to the aup,
1228
15:26:16,990 --> 15:26:18,649
Dale Carder: Is there a problem?
1229
15:26:19,940 --> 15:26:25,579
Dale Carder: So we figured this out essentially to the degree of very static resources. Right? This works
1230
15:26:25,710 --> 15:26:29,680
Dale Carder: very well for the tier ones and tier two is especially in the Us.
1231
15:26:29,890 --> 15:26:40,300
Dale Carder: But it does not to me have a clear understanding of how you would integrate external resources into this. Um!
1232
15:26:40,580 --> 15:26:50,020
Dale Carder: It's an open discussion uh at this point. It's not like I'm here with any answer. I'm just saying like I think we can all agree that something to be worked on
1233
15:26:50,830 --> 15:27:07,149
Dale Carder: um that has big and public implications, particularly for the transatlantic traffic. So that's why I had this on. Here is yes, that currently has five, one hundred. You pass across the Atlantic. We're bringing up uh two additional four hundred gig um paths.
1234
15:27:07,210 --> 15:27:08,380
Dale Carder: Um
1235
15:27:08,730 --> 15:27:12,250
Dale Carder: sometime next year. Hopefully, these are like very, very.
1236
15:27:12,320 --> 15:27:21,510
Dale Carder: It intensive builds um to get. You know we're not just buying circuits. We're buying spectrum on undersea cables and integrating it into our network
1237
15:27:21,930 --> 15:27:25,809
Dale Carder: so and then the contracting side of this is
1238
15:27:26,100 --> 15:27:39,229
Dale Carder: mind-bogglingly complex and these are multi year procurements with nda's in place. So we have additional links that we're going to come in after these two by four hundred. So we're trying to get on additional cables with additional spectrum.
1239
15:27:39,240 --> 15:27:46,920
Dale Carder: All of this is very easy for us to integrate into X one. It's very easy for us, and straightforward it integrated into the do ecosystem
1240
15:27:47,680 --> 15:27:53,679
Dale Carder: Again, How would you use? How would you use that with the like third party? Cloud sites
1241
15:27:54,780 --> 15:27:58,399
Dale Carder: open for exploration? It's not clear.
1242
15:27:59,460 --> 15:28:16,819
Enrico Fermi Institute: So so is it fair to say that, you know it seems like all the physical capability is kind of there when it comes to talking to clouds, but you know, doing things like getting a block of ips and announcing those to Lhc. One is, and challenging with the public clouds.
1243
15:28:16,880 --> 15:28:17,920
Dale Carder: Yup,
1244
15:28:18,030 --> 15:28:19,470
Dale Carder: um!
1245
15:28:19,580 --> 15:28:27,449
Dale Carder: Whereas maybe a more straightforward topology is actually maybe something more like he cloud where
1246
15:28:27,590 --> 15:28:29,010
Dale Carder: you know, from
1247
15:28:29,040 --> 15:28:32,810
Dale Carder: the networks perspective, it's Fermi lab on either end.
1248
15:28:32,960 --> 15:28:36,210
Dale Carder: It's for me lab stuff in the cloud for me to have stuff at home,
1249
15:28:36,310 --> 15:28:37,490
Dale Carder: and then it
1250
15:28:37,660 --> 15:28:39,219
Dale Carder: and branch out
1251
15:28:39,710 --> 15:28:59,000
Enrico Fermi Institute: that maybe it more workable model for at least for doe. So you're saying like for for transatlantic traffic you would. The Fermi lab is kind of the the responsible party for making sure that that they're agreeing with the aup and their traffic going across the transatlantic link is is Lhc. Traffic, and
1252
15:28:59,010 --> 15:29:02,869
Enrico Fermi Institute: and an appearing happens between the cloud and and Fermi Lab,
1253
15:29:03,190 --> 15:29:11,249
Dale Carder: or and yes. But yeah, So the the essentially is such that you know any do we resource can can do whatever they want?
1254
15:29:11,340 --> 15:29:24,840
Dale Carder: Um, including talk to universities. Um. But at present the aup doesn't straightforwardly allow a tier two to use cloud resources that would be brokered by Yes, and as the middle man,
1255
15:29:25,600 --> 15:29:31,230
Dale Carder: to use a cloud resource and expect it to use all this transatlantic capability
1256
15:29:31,390 --> 15:29:33,569
Dale Carder: on that he always invested in.
1257
15:29:36,890 --> 15:29:48,730
Dirk: Yeah, I wanted to comment on that, and I I think I mean you already said that that's part of the the strategy that you Cms is going with with with. Have cloud that we
1258
15:29:48,830 --> 15:29:51,169
Dirk: we kind of keep it contained.
1259
15:29:51,180 --> 15:30:08,940
Dirk: So if we like the large okay, we haven't done anything with large cloud use in a while like, not nothing like the the Amazon test and the Google test, and five six years ago. But but even then I think we only targeted regions, the resources in the Us. So that the the kind of the data traffic,
1260
15:30:09,170 --> 15:30:17,359
Dirk: the data traffic was contained in the Us. Mostly to between Fermi lab and these external resources, and then any kind of
1261
15:30:17,370 --> 15:30:31,629
Dirk: output. The output is transferred over the transfer Linux somewhere else to your up inside. That, then, is an independent step that comes after, and it can go through the Lg. One network because it's it originates at Fermi L. At that point,
1262
15:30:31,860 --> 15:30:48,299
Dirk: and the same way for the Hpc. Integration that we, the the way we integrate these Hpc resources is is they're connected to Fermi Lab. Everything stays together basically uh with with uh within the Us. And um,
1263
15:30:48,550 --> 15:30:56,320
Dirk: I don't know. I mean, Fernando, if if if you have a contract as as if Cms would have a cloud contact and they would want to do a run
1264
15:30:56,390 --> 15:31:05,890
Dirk: where they basically use all the regions in the world together. Then that's obviously. Then then it becomes a a problem. Because you're you're talking about overlaying Uh:
1265
15:31:06,820 --> 15:31:14,690
Dirk: the global cloud resource. Mix on top of a somewhat partition network infrastructure.
1266
15:31:17,700 --> 15:31:20,260
Dirk: Fernando: What regions are you using right now?
1267
15:31:20,630 --> 15:31:26,110
Dirk: Okay. So it's It's all Europe okay
1268
15:31:28,290 --> 15:31:32,210
Dale Carder: and just domestic to the Us. Um,
1269
15:31:32,900 --> 15:31:35,930
Dale Carder: you know the universities, like the tier, two sites
1270
15:31:36,070 --> 15:31:42,160
Dale Carder: to large degree of separated their You know their Lhc traffic from the rest of their institution traffic.
1271
15:31:42,440 --> 15:31:43,320
If
1272
15:31:43,420 --> 15:31:48,399
Dale Carder: those lines are to get blurred that could have essentially uh impact
1273
15:31:48,560 --> 15:31:55,439
Dale Carder: on the universities, you know, like you can imagine scientific workloads overwhelming. You know the cat videos and streaming lectures
1274
15:31:55,630 --> 15:32:03,819
Dale Carder: right? So it's. It's something to be quite mindful of how the current sort of ecosystem is built, and if you wanted to more fit
1275
15:32:04,130 --> 15:32:07,150
Dale Carder: the communication necessary to do so,
1276
15:32:15,340 --> 15:32:18,480
Dale Carder: So that's that's what I had. I'm happy to
1277
15:32:18,540 --> 15:32:21,630
Dale Carder: answer more questions, or even just
1278
15:32:22,990 --> 15:32:29,649
Enrico Fermi Institute: I had a small question. Yeah, you. You mentioned that the the connectivity to Nsf.
1279
15:32:29,710 --> 15:32:53,440
Enrico Fermi Institute: Uh sites is, uh, I guess, Spotty, maybe. Yeah, I notice how I didn't put that in the slide, but I can. I can read between the lines. Um! So what is so? You know that there's this. There's a facility that's being built up or is built outside of Boston, some acronym, but it's like a green data center type thing um that all of the Boston area
1280
15:32:53,450 --> 15:33:13,260
Enrico Fermi Institute: uh, and something that both Cms and I know Alice's as well as they have some large storage um, some large tape library that uh that we've each bought some part into it. Is this: Uh: on the end of the the better connected. Uh,
1281
15:33:13,630 --> 15:33:21,159
Dale Carder: yeah, it It benefits that. You know it's basically on network for Mit.
1282
15:33:21,440 --> 15:33:35,470
Dale Carder: So all right, So they They're facilitating a lot of the They're even going to be facilitating. I think it was in the interim the connectivity for uh for net two, which is the atlas uh node there right
1283
15:33:35,830 --> 15:33:50,070
Dale Carder: so I don't know if there's anyone from Mit on that call here, but I think the majority of their stuff is at base lab. It's not an Mp. Pcc. But net. Two does have their their new infrastructure, and their existing infrastructure will be at Mghpcc.
1284
15:33:50,720 --> 15:33:53,170
Dale Carder: And right they have some,
1285
15:33:53,690 --> 15:33:58,469
Dale Carder: you know, magic storage back into my understanding. They're gonna leverage for that.
1286
15:33:58,920 --> 15:34:17,979
Enrico Fermi Institute: Uh, I think that one of the folks uh they have a very large Ibm tape library with Gpfs upfront.
1287
15:34:20,850 --> 15:34:23,289
Dale Carder: So you get another question. Hand up, David.
1288
15:34:24,860 --> 15:34:37,569
David Southwick: Hi, thanks. Um! Maybe This is a naive question. But if you've got in the current scenario of traffic, let's say tunneling through for me. Um! And you're wanting to add
1289
15:34:38,030 --> 15:34:44,399
David Southwick: for whatever uh cloud providers, and they're all at two hundred four hundred gigabit.
1290
15:34:45,120 --> 15:34:55,080
David Southwick: You get a bottleneck when you do that
1291
15:34:55,480 --> 15:34:59,770
Dale Carder: right? So that sort of architecture is fine to a point.
1292
15:35:02,510 --> 15:35:05,119
David Southwick: Okay, thanks. I think I understand.
1293
15:35:05,180 --> 15:35:08,980
Dirk: Maybe to to say something. I mean the what we did with
1294
15:35:09,510 --> 15:35:17,119
Dirk: through Hep Cloud integration. It's not So much tunneling for farming is that you basically keep the problem set contained to
1295
15:35:17,130 --> 15:35:36,849
Dirk: for me plus cloud. And then in a later, in the completely asynchronously of of the first one. It's of how Fermi integrates with the rest of the Lhc. In infrastructure. So you you kind of to tie it together at the storage level. Basically you move some data to Fermi Lab. And then, independently of that, once that data actually sits there.
1296
15:35:36,860 --> 15:35:49,760
Dirk: Then you can schedule work on that data that can run on on cloud sites. And then the network traffic to get that data to the cloud side runs from farming. Basically So they're independent steps. But of course, I mean eventually,
1297
15:35:50,210 --> 15:36:07,170
Dirk: just because you removed the timing, and it's not an immediate tunnel it's. Still, you still have to get that to keep these resources fed at at cloud, and also at Hbc. Side. So eventually, as as the integrated capacity you want to, you want to feed in terms of computing
1298
15:36:07,180 --> 15:36:15,850
Dirk: it goes up. You you kind of you have to also work, on the other hand, to basically keep the pipeline full of things to work on.
1299
15:36:17,670 --> 15:36:25,430
Enrico Fermi Institute: So so with that with that connectivity, or with the connectivity that's in place today with that model that the Fermi lab
1300
15:36:25,440 --> 15:36:45,199
Enrico Fermi Institute: used or is using, I mean, would that be able to take advantage of of all that physical connectivity? I mean the thing i'm kind of struggling with is, how do we? How do we go from, You know? Yes, that has all this great physical connectivity to clouds. Uh to You know. How do we take advantage of that? And a meeting full way.
1301
15:36:45,210 --> 15:37:01,969
Enrico Fermi Institute: You know what I mean, and and I know a lot of that kind of falls under your bucket of things that are hazy and need to be investigated more. Um, you know. Is it? Is it that you know, like if we were to do this for Alice, should we, you know, mediate all of the data transfer through the tier one and and kind of,
1302
15:37:02,140 --> 15:37:06,130
Enrico Fermi Institute: I guess orthogonalize the problem kind of like how formula it has right where you have
1303
15:37:06,160 --> 15:37:11,739
Enrico Fermi Institute: connectivity from from cloud to to national lab as one bit, and then national lab to
1304
15:37:11,780 --> 15:37:29,779
Dale Carder: right. So you've got that, I mean, that's the class of solutions. Right? That's the solution space. If you want to work within those confines. If I are, you know, a program officer at uh doe or Nsf. I would say,
1305
15:37:30,210 --> 15:37:34,860
Dale Carder: Why do you need to do that? What are the other barriers that exist?
1306
15:37:34,930 --> 15:37:38,519
Dale Carder: Tackle those as well? Because some of these are social, political,
1307
15:37:38,680 --> 15:37:54,999
Enrico Fermi Institute: alright. So it's sort of just where do you want to? I mean, of course, our goal is to, you know, have something to to say in the report, right? And so what what recommendation should we make right that that people go
1308
15:37:55,310 --> 15:38:05,190
Dale Carder: right? So on that front, on one thing that basically came out of this community. Um, if you want to back way up, was the current um
1309
15:38:05,300 --> 15:38:23,189
Dale Carder: grant system at Nsf. Has through the what's now the Cc star uh program that facilitates campus uh and regional upgrades basically manifested from the Yes net science Team Z model. And then, asf uh community buying in that is the
1310
15:38:23,310 --> 15:38:36,369
Dale Carder: an an architectural model that they should provide, you know, financial support, for if you could extend upon that and say, You know, if you can imagine a world where you could seamlessly take advantage of resources, no matter where they lie. What would you need?
1311
15:38:36,610 --> 15:38:42,729
Dale Carder: Couldn't us that program evolve, or again facilitate that kind of uh, you know,
1312
15:38:42,760 --> 15:38:43,990
Dale Carder: connectivity,
1313
15:38:45,710 --> 15:38:50,300
Dale Carder: you know. And in the time scale where it's talking about that's not unreasonable.
1314
15:38:54,850 --> 15:38:55,900
Enrico Fermi Institute: Okay,
1315
15:38:57,950 --> 15:39:00,670
Enrico Fermi Institute: were there other questions for Dale.
1316
15:39:08,300 --> 15:39:13,280
Enrico Fermi Institute: Okay? Well, thanks a lot, Dale. I think this is a really interesting discussion.
1317
15:39:13,310 --> 15:39:20,250
Dale Carder: Yeah, um. And i'll stick around um for the rest of the conference, too. So more stuff comes up. Um,
1318
15:39:20,320 --> 15:39:21,780
Enrico Fermi Institute: yeah, that'd be great.
1319
15:39:22,730 --> 15:39:25,880
Enrico Fermi Institute: All right. I will try to go back to the
1320
15:39:26,070 --> 15:39:28,459
Enrico Fermi Institute: sharing the slides over here.
1321
15:39:28,910 --> 15:39:30,250
Enrico Fermi Institute: Um.
1322
15:39:30,800 --> 15:39:36,420
Enrico Fermi Institute: So the next section. This kind of leads in the next section. We wanted to talk a little bit about
1323
15:39:36,490 --> 15:39:38,910
Enrico Fermi Institute: R. And d efforts.
1324
15:39:41,730 --> 15:39:44,150
Enrico Fermi Institute: Now we've covered some of this already.
1325
15:39:46,490 --> 15:39:50,170
Enrico Fermi Institute: Um, Dirk, Did you want to say a couple of things about this?
1326
15:39:50,440 --> 15:40:03,390
Dirk: Yeah. And that that comes through? This comes directly i'll look at on the comes directly from a question that's in the charge Where, basically, ask us, is there anything we can do on the on the site, or that
1327
15:40:03,670 --> 15:40:05,530
Dirk: that is needed to
1328
15:40:05,900 --> 15:40:09,369
Dirk: to what we can do needed to expand
1329
15:40:09,590 --> 15:40:23,570
Dirk: the range of what we can do on commercial Cloud and Hpc. All increase the cost effect on us, which kind of goes hand in hand. And uh, we already talked a little bit about Lcf integration and the Hpc. Focus area that there's
1330
15:40:23,640 --> 15:40:27,459
Dirk: work to be done on the Gpu workloads, which is
1331
15:40:27,810 --> 15:40:35,630
Dirk: somewhat out of scope for this conference, because we're not for this workshop because we're not supposed to talk about framework on software development.
1332
15:40:35,680 --> 15:40:52,100
Dirk: Um, But then there's also integration work. We talked a little bit about this on the cost side that it's a bit at this point uh like estimating Lc. F Long term operations. Cost is a bit hard because the integration is not fully worked out.
1333
15:40:52,170 --> 15:41:01,009
Dirk: Um software delivery kind of during the Hbc. Focus for every kind of agreed that even if it's is everywhere,
1334
15:41:01,020 --> 15:41:12,510
Dirk: and then there's at services which is also every Hpc. Seems to do their own thing and what they support. They all want to support it, but they kind of have different solutions in place,
1335
15:41:12,540 --> 15:41:15,390
Dirk: and it's also to me at least a bit unclear
1336
15:41:15,420 --> 15:41:20,420
Dirk: with the long-term operational needs there on this area.
1337
15:41:20,900 --> 15:41:28,610
Dirk: And then we already talked a little bit about dynamic cloud users. Uh, which means basically you you. You do like your whole.
1338
15:41:28,750 --> 15:41:44,449
Dirk: The whole processing chain inside the cloud uh phenomena talked about that a little bit because it to reduce e egress charges. We basically you copy and your input data once or and then do multiple processing runs on it and
1339
15:41:44,460 --> 15:41:56,950
Dirk: only keep the the end result basically and forget about the the intermediate output, and then you save one. You don't have to get it out. You only have to get the smaller final output. We already talked about machine learning.
1340
15:41:58,040 --> 15:41:59,560
Dirk: And then uh,
1341
15:42:01,030 --> 15:42:20,909
Dirk: there's uh on d work on on different architectures to be able to support this uh which opens up possibilities in both Hpc. And Cloud use uh Fpga's um various Gpu types that feeds into the Gpu workloads, but it's not exclusive to just uh
1342
15:42:21,080 --> 15:42:32,130
Dirk: uh Gpu workloads, because it could also be machine learning like, How how do we integrating machine learning to make use of these new architectures? And that's gonna be
1343
15:42:32,750 --> 15:42:35,820
Dirk: integration on D. But also basic uh
1344
15:42:35,910 --> 15:42:41,970
Dirk: basic on the on, on some of these topics. And then there was some
1345
15:42:42,710 --> 15:42:50,129
Dirk: things that we're kind of playing around with unique that are unique to the cloud where they offering platforms that we
1346
15:42:50,460 --> 15:43:07,240
Dirk: kinda that's hard to replicate in-house uh like there's, some a big, very big table experiments function as a service. I don't know too much about it. We just threw it on here. Maybe Lindsay or Mike could say something about that, or someone else that's more familiar with that.
1347
15:43:10,780 --> 15:43:17,670
Paolo Calafiura (he): I won't say that i'm familiar with functions as a service. But I I just want to mention that this is also
1348
15:43:17,690 --> 15:43:30,329
Paolo Calafiura (he): um an area important for Hpc's. Then they are developing. They are developing a function that they probably the same at the same framework, the Funkx framework. Yes,
1349
15:43:30,340 --> 15:43:48,699
Paolo Calafiura (he): and uh that there is that there is apparently a a sol in for to to the main Lcs of of of bunkx using something called. So. This is something we are very interested in to a Cc. Is a possible joint project across the
1350
15:43:48,710 --> 15:44:05,420
Enrico Fermi Institute: so I guess from personal experience. Uh, we actually quite routinely use parcel uh for farming out the analysis jobs. Uh, and at some point back in the day there was a proof of concept
1351
15:44:05,430 --> 15:44:11,389
Enrico Fermi Institute: using a Fung X endpoint and doing analysis jobs with that
1352
15:44:11,420 --> 15:44:36,939
Enrico Fermi Institute: um. So all of the groundwork for that is actually been laid out. Um! And we could return to using that. We just ended up using desk a little bit more prevalent prevalently. But it's also something that's up up to the user at, or that we left up to the user at the end of the day, and if we want to develop more infrastructure around that uh it, we have a basis to start prompt
1353
15:44:36,950 --> 15:44:53,969
Enrico Fermi Institute: uh as far as like going to like production workflows or reconstruction, or something like that. I don't think that's been explored at all. Um, but it's it. It looked really promising and interesting from the analysis uh analysis view of things.
1354
15:44:53,980 --> 15:45:07,179
Enrico Fermi Institute: And I think at the time it was just a little bit immature compared to where things have gone more recently for bigquery and big table. I think this is actually
1355
15:45:07,960 --> 15:45:21,790
Enrico Fermi Institute: uh, right. This is this. This was studied by Gordon Watson Company, and they did a a couple of benchmarkings of what the performance per dollar was for analysis, like queries on
1356
15:45:21,830 --> 15:45:26,670
Enrico Fermi Institute: data sets backed by various engines,
1357
15:45:27,330 --> 15:45:44,309
Enrico Fermi Institute: and we could go and take a look at that paper. But the gist of it was that bigquery and big table are uh not nearly as cost-efficient as uh using Rdf. For instance, or or coffee um, or well, awkward, or a plus uproot, for instance.
1358
15:45:44,320 --> 15:46:01,499
Enrico Fermi Institute: So there's already some demonstrations that while these offerings are there, they're not quite up to the performance that we can already provide with our home grown tools. But maybe this also provides a uh way to talk with the bigger cloud services and say, Hey,
1359
15:46:01,510 --> 15:46:06,510
Enrico Fermi Institute: this is the kind of performance we need. Can we do any? And beenance matching here?
1360
15:46:08,310 --> 15:46:15,509
Dirk: What? Sorry That was a bit of an information. No, it's fine. But the thing is this is all
1361
15:46:16,260 --> 15:46:27,480
Dirk: the question. One basic question I had about this is while while some of these areas that are being worked on can provide quite a a great
1362
15:46:27,500 --> 15:46:34,020
Dirk: improvement and user experience like. And at the analysis level you you just
1363
15:46:34,090 --> 15:46:40,670
Dirk: yeah to to what extent are the applicable. If you look at like a global picture of
1364
15:46:40,730 --> 15:46:48,980
Dirk: experiment resource use, I mean that because the individual user experience doesn't necessarily mean you. You save a lot of
1365
15:46:48,990 --> 15:47:02,780
Dirk: resources overall, but you you can make life easier for your users, and you improve the physics output, and that's all great. It's just um in terms of looking at that application of
1366
15:47:02,950 --> 15:47:09,230
Dirk: of of money in terms is Is this a large enough area that that we have to
1367
15:47:10,220 --> 15:47:23,680
Enrico Fermi Institute: how prominently should we put it into the report? Basically, that's what i'm trying to get
1368
15:47:23,690 --> 15:47:37,939
Enrico Fermi Institute: as you make things more scalable, so that the folks can like you know, do the the first exploratory bits of their analysis from their laptop, and then scale that seamlessly into the cloud with Funkac, or whatever do that we admit
1369
15:47:38,120 --> 15:47:55,869
Enrico Fermi Institute: um, if you can make it so that those first exploratory steps are at less scale, then of course, that means that the resource usage, as you scale up more and more, is going to be much more uniform between all the users that you one
1370
15:47:55,880 --> 15:48:08,630
Enrico Fermi Institute: uh have engaging with the system, which means you can probably schedule it all a little bit better, as far as you know, which I think is, is another way of saying. You know you just make things nicer for the users. Um,
1371
15:48:08,640 --> 15:48:28,579
Enrico Fermi Institute: but it one. It means that. Uh, uh, we are figuring out a schedule all that becomes easier, which means it becomes, uh, more efficient from your perspective, or from the operational perspective, I would say. And then, uh, it also changes the way in which people
1372
15:48:28,590 --> 15:48:51,250
Enrico Fermi Institute: uh compete for resources at clusters, because all the analysis start looking more and more the same. Um. And they also start reaching the larger resources at a higher level of maturity than perhaps what you see even nowadays. Sometimes people just run stuff and see what happens. And it's very, very experimental, software let's say
1373
15:48:51,260 --> 15:48:54,349
Enrico Fermi Institute: um. So I I
1374
15:48:54,520 --> 15:49:00,139
Enrico Fermi Institute: to answer your question of like, is this big enough to care?
1375
15:49:00,760 --> 15:49:15,249
Enrico Fermi Institute: I have a feeling that right now it is big enough to care, and the fact that we're getting more data is going to keep it in the regime of being big enough to care and report and make sure that we actually make a special or treat this,
1376
15:49:15,260 --> 15:49:40,909
Enrico Fermi Institute: at least in a special way, because the resource uses pattern is wildly different from production. Um! But as we roll out these uh the things like functions as a service, or uh figure out how to scale a column or analysis, and our data frame effectively uh it's going to mink the competition. Or yeah, it's going to make the usage of resources less and easier to manage, which is kind of good for us.
1377
15:49:40,920 --> 15:49:53,019
Enrico Fermi Institute: But also uh it's not going to make it a bigger piece of the competition for all the computing resources, So that's what it sort of looks like in my mind kind of extrapolating from what we have right now. One hundred and fifty.
1378
15:49:53,070 --> 15:50:12,099
Enrico Fermi Institute: Uh, I think The answer then is, uh, we need. We need to watch it and see what these systems that are just starting to come online actually do for resource usage in uh, even if it's not at scale and see if it does bring kind of this evening out of of competition for resources at tier two is
1379
15:50:12,110 --> 15:50:15,289
Enrico Fermi Institute: um and otherwise making the analysis,
1380
15:50:15,620 --> 15:50:21,180
Enrico Fermi Institute: analysis, computing usage a bit more, even as far as
1381
15:50:21,370 --> 15:50:25,670
Enrico Fermi Institute: sorry, even as far as Job submission goes. And things like that.
1382
15:50:25,860 --> 15:50:29,870
Enrico Fermi Institute: That's that's sort of my view. I I of course.
1383
15:50:30,000 --> 15:50:38,340
Enrico Fermi Institute: Yeah, this is trying to predict the future. So other people please feel free to predict the future, too, and we can see what works
1384
15:50:39,280 --> 15:50:57,220
Paolo Calafiura (he): always always very informative to hear to hear from you Parents? Uh, I I'm certainly not nearly as competent, and I know that are more competent people in the call who may want to chime in. But uh, our interest from the Cc. Sign
1385
15:51:05,270 --> 15:51:24,750
Paolo Calafiura (he): complex Enough that the paradox. And by the way, Derek, you yesterday we heard that that the Cms. Cms uh is um sort of fighting against the provisioning, challenging the provision challenges, you know, creating workers with the with the right.
1386
15:51:24,760 --> 15:51:28,160
Paolo Calafiura (he): Uh, we divide the capabilities.
1387
15:51:28,170 --> 15:51:50,549
Paolo Calafiura (he): Uh, you know to some extent that I don't know which has since, because i'm in combat, that these issues have been addressed by the by, the by, the folks with developed parts of So some of those issues uh have made the Atlas think that far so it could be a good back end for some of our existing code in in this sort of
1388
15:51:50,560 --> 15:51:56,159
Paolo Calafiura (he): and I I I i'm hoping that somebody has more competent jump.
1389
15:51:57,290 --> 15:52:13,480
Enrico Fermi Institute: Um! The only thing that I can tack on to that is that uh Anna and Company back in the day uh figured out how to make a back filling system uh using funkx and parcels. So that's that's definitely something that works
1390
15:52:13,530 --> 15:52:29,769
Enrico Fermi Institute: Um, and you can, and that's also what the guys at Nebraska are doing with the last or with the the coffee Casa analysis facility as they're back filling into the production jobs. So for sure, this is a pattern that works, and that people can implement. But,
1391
15:52:29,780 --> 15:52:34,630
Enrico Fermi Institute: uh, we also we don't. We don't know how it how it scales out uh
1392
15:52:34,750 --> 15:52:43,950
Enrico Fermi Institute: you know, to more and more data and more and more users. The The usage right now, I would say, is fairly limited. And yeah, that's
1393
15:52:45,020 --> 15:52:50,759
Enrico Fermi Institute: I. I think that helps at some context. But we definitely need to hear from more people on this,
1394
15:52:51,470 --> 15:52:59,310
Dirk: hey? Maybe just one comment that Jeff, we're primarily interested in production here. But, on the other hand, analysis takes over
1395
15:52:59,610 --> 15:53:06,270
Dirk: half our resources or half the tools, at least, so there's a significant fraction. So if analysis gets easier you
1396
15:53:06,690 --> 15:53:13,279
Dirk: that means maybe there's more resources for production to use just as a quick correction. It's only a quarter dirk.
1397
15:53:13,390 --> 15:53:18,340
Dirk: Oh, it's a quarter of it. You I thought it's half the T. Choose. Now it's a quarter
1398
15:53:18,530 --> 15:53:20,280
Dirk: that's a quarter. Now. Okay,
1399
15:53:20,350 --> 15:53:28,460
Enrico Fermi Institute: yeah, as a as more production just shows up the the the fraction gets smaller and smaller.
1400
15:53:33,200 --> 15:53:46,199
Enrico Fermi Institute: But Yeah, there, I mean, just thinking about it more. There's also this rather severe impedance mismatch, at least right now, with the kind of the can. The cadence of analysis jobs versus uh production cops,
1401
15:53:46,210 --> 15:53:55,879
Enrico Fermi Institute: since it's much more bursty and short-lived as opposed to a production job that comes in, and you know it's going to use twenty four hours in a slot or something like that.
1402
15:53:56,180 --> 15:54:02,060
Enrico Fermi Institute: So it's by it. By its very nature it's a much more adaptive
1403
15:54:02,510 --> 15:54:06,890
Enrico Fermi Institute: and reactive scheduling problem.
1404
15:54:20,280 --> 15:54:28,630
Enrico Fermi Institute: So one of the things that we mentioned with the cloud offerings, I mean, we had a couple of examples. There are big, very big table functions of the service.
1405
15:54:28,650 --> 15:54:47,950
Enrico Fermi Institute: One of the questions I had at least, was it. Is there anything i'm missing right like on the cloud? Right? Because if you go and look at the service catalog for something like aws. It has this humongous, you know, spread of, of of things that they can services that they offer. Uh, is there anything that we're
1406
15:54:47,990 --> 15:54:49,940
Enrico Fermi Institute: leaving on the table that we should
1407
15:54:50,600 --> 15:54:51,950
Enrico Fermi Institute: you should look into?
1408
15:54:55,200 --> 15:54:59,800
Enrico Fermi Institute: Uh, I'll say that something that's interesting.
1409
15:55:00,150 --> 15:55:18,890
Enrico Fermi Institute: Maybe not. Maybe not just for uh clouds, but also for sort of on premises. Facilities is uh things like sonic that lets us sort of um disaggregate the gpus and the cpus. So if you're doing inference, you might not need a whole Gpu. But
1410
15:55:18,900 --> 15:55:27,490
Enrico Fermi Institute: you know, as someone you know, either you buy very expensive, Let's say in the cloud case. Let's just stick that. So, you know you might have to buy. You might be buying a bunch of Gpu nodes
1411
15:55:27,500 --> 15:55:39,980
Enrico Fermi Institute: uh which are many times more expensive. But you know, if the reconstruction path only needs a quarter of a gpu being able to independently scale up the number of gpus and cpus that you're running at a time. Um,
1412
15:55:39,990 --> 15:55:51,770
Enrico Fermi Institute: it's something useful. And then, like I mentioned like for on premises stuff, too, because you can stick either two or four of these gpus into a box. But if the core count is two hundred and fifty-six on the node, then
1413
15:55:52,010 --> 15:55:54,990
Enrico Fermi Institute: you you better hope that the the
1414
15:55:55,060 --> 15:56:01,679
Enrico Fermi Institute: the fraction of time that you're spending a gpu and the speed up that you get, you know, and dolls, law and all that actually makes it worthwhile
1415
15:56:12,330 --> 15:56:13,160
you.
1416
15:56:19,070 --> 15:56:38,129
Enrico Fermi Institute: Yes, and and going on to that like there's also going there is going to be uh, and there already is, and it will be an ever growing class of analysis user, that is asking for Gpus, too, and you have to again deal with this very different rate of scheduling resources for them.
1417
15:56:38,430 --> 15:56:55,730
Enrico Fermi Institute: Um, and sometimes there the amount of, or at least the the the burstiness of the data processing that they're trying to do on that Tv is much much higher compared to like a production job. Even if the resource, the total resources, are much higher on the production side, just because of job multiplicity
1418
15:56:55,740 --> 15:57:19,540
Enrico Fermi Institute: that you have users that are, you know, just poking around doing their exploratory stuff, and right now we give them a whole T four. Well, t four per hour is not cheap, not cheap at all. So and you'll have people like training models and then loading it onto a T for running their running, running their whole signal data set, or something like that, to see what it looks like in the tails, et cetera, et cetera, or running it on their backgrounds.
1419
15:57:19,580 --> 15:57:24,290
Enrico Fermi Institute: And it's still the same problem of needing to
1420
15:57:24,450 --> 15:57:42,980
Enrico Fermi Institute: uh, very piecemeal uh schedule your gpus, and then on top of that schedule, all the networking between them, because you have this really insane burst of uh inference requests for a very short amount of time that you need to negotiate on your network to not net or not mess with everyone else's jobs.
1421
15:57:43,170 --> 15:57:44,580
Enrico Fermi Institute: So
1422
15:57:44,620 --> 15:57:54,399
Enrico Fermi Institute: it might not be. It might not be a huge what you said It's a quarter of the tier two right now. It's. Let's say it just stays a quarter of that. But the
1423
15:57:54,590 --> 15:58:09,069
Enrico Fermi Institute: the the way that it's going to be using the resources if it's that bursty may not look like a quarter at certain points in time during the analysis workflow, and that's something we have to be ready to deal with.
1424
15:58:09,370 --> 15:58:13,230
Enrico Fermi Institute: I have no idea how to actually schedule that.
1425
15:58:13,490 --> 15:58:14,539
Mhm
1426
15:58:19,200 --> 15:58:23,320
Enrico Fermi Institute: So so we're almost at the top of the hour.
1427
15:58:23,800 --> 15:58:28,420
Enrico Fermi Institute: So any other topics that we wanted to hit before we wrap up for the day.
1428
15:58:41,590 --> 15:58:47,809
Enrico Fermi Institute: So I think logistically, we were going to tomorrow. Talk a little bit about.
1429
15:58:49,090 --> 15:58:54,949
Enrico Fermi Institute: See? In the morning I think we were going to talk about accounting and pledging.
1430
15:58:55,240 --> 15:58:57,530
Enrico Fermi Institute: We're going to talk about some, you know.
1431
15:58:57,840 --> 15:59:14,780
Enrico Fermi Institute: Facility, features, policies. How did a discussion about security topics when it comes to Hpc. And Cloud. Um: Yeah. Allocations, you know, planning that sort of thing, I think, in the afternoon,
1432
15:59:14,790 --> 15:59:18,350
Enrico Fermi Institute: and have a a presentation from the
1433
15:59:18,520 --> 15:59:22,869
Enrico Fermi Institute: from the Vera Rubin folks to talk about their experiences.
1434
15:59:23,700 --> 15:59:42,449
Enrico Fermi Institute: And then, yeah, some summary type of work and and just you know other other topics or observations that people would like to bring up. So I mean, if there's something that that hasn't that we haven't hit on the agenda that people would really like to talk about um tomorrow afternoon. It'd be a really good time to to bring that
1435
15:59:47,150 --> 15:59:49,349
Enrico Fermi Institute: anything else from anyone.
1436
15:59:55,150 --> 16:00:00,209
Enrico Fermi Institute: Okay, sounds like, Not all right, Thanks, everybody. We'll talk to you tomorrow.
1437
16:00:01,790 --> 16:00:03,559
Fernando Harald Barreiro Megino: Hi. Thank you.