[Easter time]
CERN presentation
[Eastern Time]
CERN presentation
615
14:00:29,710 --> 14:00:34,679
Enrico Fermi Institute: I think this is the last session that's focused exclusively on cloud.
616
14:00:36,900 --> 14:00:37,920
Yeah.
617
14:00:38,670 --> 14:00:44,219
Enrico Fermi Institute: In the next session we'll talk about some R and D things and and networking. So
618
14:00:52,660 --> 14:00:57,720
Enrico Fermi Institute: okay, so maybe we break here and we'll we'll uh see everybody at one o'clock central time,
619
14:00:58,540 --> 14:01:00,130
Fernando Harald Barreiro Megino: so you know
620
14:01:01,310 --> 14:01:02,620
Enrico Fermi Institute: she learning,
621
14:01:03,820 --> 14:01:09,699
Enrico Fermi Institute: and then we'll we'll go back to the the topics as presented in the slides,
622
14:01:10,610 --> 14:01:12,850
Enrico Fermi Institute: so we'll just get started in a few minutes here,
623
14:01:53,380 --> 14:01:56,370
Maria Girone: so it's a Eric starting first, right.
624
14:01:56,540 --> 14:02:04,520
Maria Girone: Yeah, if Eric is ready to present, we thought maybe it would be best to just have them
625
14:02:07,990 --> 14:02:16,680
Enrico Fermi Institute: getting a little bit late. Concerned. Yeah, exactly. We want to be considerate of people's time in your especially. Thank you.
626
14:02:29,590 --> 14:02:39,579
Enrico Fermi Institute: So just give it like two more minutes, and then um, Eric, whenever you're ready to, you know. Put your slides up. I'll I'll stop sharing here. Um, when we get started shortly.
627
14:02:42,740 --> 14:02:47,450
Eric Wulff: Sounds good. I'm uh ready. Whenever So just let me know. Okay,
628
14:02:48,630 --> 14:02:49,570
you
629
14:02:54,390 --> 14:02:55,219
The
630
14:03:09,350 --> 14:03:17,650
Enrico Fermi Institute: It seems like the rate at which people have. Uh that rejoining has has slowed down significantly. So I think you can go ahead and and and and start
631
14:03:22,080 --> 14:03:23,529
Eric Wulff: uh, Okay.
632
14:03:24,610 --> 14:03:25,870
Eric Wulff: So
633
14:03:27,290 --> 14:03:31,050
Eric Wulff: i'm sharing. Now, I think. Can you see?
634
14:03:31,340 --> 14:03:33,999
Eric Wulff: Yes, it looks good. Okay, great.
635
14:03:34,560 --> 14:03:37,929
Eric Wulff: Um. So I I just have a
636
14:03:38,180 --> 14:03:52,689
Eric Wulff: two or three slides here. So it's a very short presentation just to talk a little bit about what we have been doing uh regarding distributed training and hypertuning uh of deep learning based algorithms using you have from us computing.
637
14:03:53,360 --> 14:04:00,499
Eric Wulff: So this is something that I have been doing in context of the A Eu Funded Research project called Say, We race
638
14:04:06,260 --> 14:04:08,620
Eric Wulff: involved in this, and she's my supervisor.
639
14:04:09,580 --> 14:04:10,969
Um.
640
14:04:12,850 --> 14:04:15,450
So let's see if I can change slide.
641
14:04:15,770 --> 14:04:17,940
Eric Wulff: Yes, um.
642
14:04:18,590 --> 14:04:24,429
Eric Wulff: So just for for if you're not aware uh hyper parameter organization. Um.
643
14:04:25,320 --> 14:04:35,079
Eric Wulff: So if you're not aware of what that is, I've tried to use it very quickly here in just one slide. So it's. I will sometimes refer to it as as a hyper tuning,
644
14:04:35,140 --> 14:04:36,670
Eric Wulff: and um,
645
14:04:36,730 --> 14:04:39,300
Eric Wulff: it's basically to um
646
14:04:39,340 --> 14:04:49,350
Eric Wulff: to tune the uh hyper parameters all the an Ai model or a deep learning model, and hyper parameters are simply the model sets. Um
647
14:04:58,840 --> 14:05:09,139
Eric Wulff: um, and they can define things like the model architecture. So, for instance, how many layers you have in your neural network? Um, How many notes you have in each layer, and so on.
648
14:05:09,520 --> 14:05:19,239
Eric Wulff: Um, but they also define things. Um, that has to do with the optimization of the model, such as the learning rates, the back size and so forth.
649
14:05:19,720 --> 14:05:20,570
Yeah.
650
14:05:22,180 --> 14:05:28,950
Eric Wulff: Now, if you have a a a large model, or a very top complex model, which it requires a lot of compute to
651
14:05:29,220 --> 14:05:30,469
Eric Wulff: and
652
14:05:31,480 --> 14:05:33,510
Eric Wulff: uh, to do the forward pass,
653
14:05:33,610 --> 14:05:34,950
Eric Wulff: and
654
14:05:35,630 --> 14:05:38,329
Eric Wulff: and or you have a large data sets.
655
14:05:38,360 --> 14:05:41,660
Eric Wulff: Um. Hypertine can be extremely
656
14:05:41,940 --> 14:05:56,630
Eric Wulff: compute resource intensive. So, therefore it can benefit greatly from Hbc. Resources. And uh, Furthermore, we need a of smart and efficient solid search algorithms to find good hyper parameters, so that we we don't waste the Hpc resources that we have
657
14:05:59,290 --> 14:06:00,480
Eric Wulff: um.
658
14:06:01,000 --> 14:06:10,500
Eric Wulff: So in race uh, I have been working with uh a group working on machine and particle flow uh, which is a
659
14:06:10,810 --> 14:06:13,939
Eric Wulff: uh in collaboration with Cms
660
14:06:14,080 --> 14:06:17,230
Eric Wulff: with people from Cms. Um, And
661
14:06:17,420 --> 14:06:19,599
Eric Wulff: in order to high opportunity, this model
662
14:06:19,690 --> 14:06:25,310
Eric Wulff: um in race we have been using uh an open source framework called rate you
663
14:06:25,750 --> 14:06:34,059
Eric Wulff: uh, which allows us to run many different trials in parallel, using uh multiple gpus per trial
664
14:06:34,270 --> 14:06:39,010
Eric Wulff: uh, which is uh what this picture up here is trying to represent.
665
14:06:39,570 --> 14:06:40,990
Eric Wulff: And
666
14:06:42,990 --> 14:06:51,389
Eric Wulff: now, with Rachel we can also get the very nice overview of the different trials, and we can. We can pick the one that we see, performs the best
667
14:06:51,580 --> 14:06:57,289
Eric Wulff: uh and right, and also has a lot of different search algorithms that uh
668
14:06:57,660 --> 14:07:01,359
Eric Wulff: help us to in the the right uh
669
14:07:01,690 --> 14:07:02,970
Eric Wulff: I, the parameters.
670
14:07:03,430 --> 14:07:18,949
Eric Wulff: And here, to the right, we have an example of of the kind of a difference this can make to to the learning of the model. So Here we have plotted the um training and validation losses for, and after hyper tuning,
671
14:07:20,620 --> 14:07:32,120
Eric Wulff: so as you can see here, the the loss went down quite a bit after hypertuning almost by a factor of two, and the furthermore, the the training seems to be much more stable. We have a
672
14:07:32,380 --> 14:07:36,559
Eric Wulff: these bands which will present the the standard deviation of
673
14:07:36,750 --> 14:07:42,170
Eric Wulff: between different trainings. It's it's much more stable in the right plot.
674
14:07:47,030 --> 14:07:56,090
Eric Wulff: Um and I just had one more slide here to sort of illustrate how you still uh high performance computing can be in order to speed up
675
14:07:56,810 --> 14:07:58,380
parameter optimization.
676
14:07:58,560 --> 14:08:03,430
Eric Wulff: Uh. So this just shows the scaling uh from four to twenty-four
677
14:08:03,680 --> 14:08:05,309
Eric Wulff: computing notes.
678
14:08:05,330 --> 14:08:06,550
Eric Wulff: Um,
679
14:08:06,990 --> 14:08:15,439
Eric Wulff: maybe particularly looking at the plot to the right here we can see that the scaling for this use case is actually better than linear
680
14:08:15,570 --> 14:08:20,269
Eric Wulff: um, which at least in part has to do with, uh
681
14:08:20,820 --> 14:08:26,109
Eric Wulff: some excessive reloading of models that happens when when we have the few notes.
682
14:08:28,060 --> 14:08:29,150
Eric Wulff: Um.
683
14:08:31,070 --> 14:08:35,830
Eric Wulff: So. Um: Well, this basically means that the more the more
684
14:08:36,030 --> 14:08:41,099
Eric Wulff: notes we have, the more people we have with the faster we can tune and apply these bottles.
685
14:08:41,670 --> 14:08:47,480
Eric Wulff: That's all I had for for this.
686
14:08:48,740 --> 14:08:58,029
Enrico Fermi Institute: Can you tell a priori from the model that that you'll that the model you're using will
687
14:08:58,080 --> 14:09:04,340
Enrico Fermi Institute: force up behavior, so that if someone comes with any given model, you know how to sort of shape the work,
688
14:09:06,550 --> 14:09:15,609
Enrico Fermi Institute: if you understand what I mean and no, no. What I mean is, you discovered that you get better than linear scaling with this training?
689
14:09:15,700 --> 14:09:16,719
Right?
690
14:09:17,160 --> 14:09:22,499
Enrico Fermi Institute: That's not always the case with, Or is that the case with any given model.
691
14:09:23,150 --> 14:09:24,459
Um,
692
14:09:25,150 --> 14:09:33,199
Eric Wulff: yeah, I think it would be so. This is sort of uh. This is showing the scaling of the hyper parameter organization itself.
693
14:09:33,650 --> 14:09:40,180
Eric Wulff: Um, so it's not. If if you had just a single training, it wouldn't scale like this it would be
694
14:09:40,360 --> 14:09:42,610
Eric Wulff: a a bit worse than linear probably.
695
14:09:45,610 --> 14:09:51,289
Eric Wulff: But So the way that the hypertuning works in this case is that we
696
14:09:51,430 --> 14:09:53,199
Eric Wulff: we launched a bunch of
697
14:09:53,690 --> 14:09:56,980
Eric Wulff: trials in parallel with different type of parameter
698
14:09:57,010 --> 14:09:58,559
Eric Wulff: configurations.
699
14:09:58,990 --> 14:10:00,189
Eric Wulff: And then
700
14:10:00,340 --> 14:10:01,780
Eric Wulff: um!
701
14:10:02,230 --> 14:10:10,820
Eric Wulff: There is a sort of a scheduling or search algorithm, looking at how well all these trials perform,
702
14:10:10,940 --> 14:10:22,829
Eric Wulff: and then it's a terminates once that look less promising and continuous training, the ones that look promising. And then we can also have some kind of base and optimization
703
14:10:23,190 --> 14:10:26,360
Eric Wulff: component here, which tries to predict which
704
14:10:27,470 --> 14:10:31,230
Eric Wulff: hyper parameters would perform. Well, and then we try those next,
705
14:10:32,930 --> 14:10:39,059
Enrico Fermi Institute: and if you were to double or triple the number of nodes you would continue to does the
706
14:10:39,310 --> 14:10:42,929
Enrico Fermi Institute: does the actual growth begin to flat now?
707
14:10:43,430 --> 14:11:00,910
Eric Wulff: Um! I I haven't tested this um more than up to twenty four notes uh, so I can't say for sure, but I I imagine it will continue for at least a bit more. But um I I can't say for how long, and
708
14:11:01,060 --> 14:11:16,039
Enrico Fermi Institute: I I also see that eventually it would flack off.
709
14:11:17,080 --> 14:11:18,540
Eric Wulff: Um,
710
14:11:19,510 --> 14:11:23,909
Enrico Fermi Institute: because that's all it. Yeah, it it's nothing. The issue is a resource contention.
711
14:11:24,600 --> 14:11:30,520
Eric Wulff: Yeah, it's a that has to do with the with this search. Algorithm That's
712
14:11:30,630 --> 14:11:32,309
Eric Wulff: um
713
14:11:33,180 --> 14:11:39,990
Eric Wulff: trains a few trials and then terminates bad once, and then continues with new ones. So
714
14:11:40,360 --> 14:11:48,789
Eric Wulff: if you have more more trials than you have notes that that you want to run uh. You have to sort of the
715
14:11:49,280 --> 14:11:54,179
Eric Wulff: post trials at some point, and can and start training other ones.
716
14:11:54,590 --> 14:11:56,110
Eric Wulff: Um!
717
14:11:56,270 --> 14:12:02,699
Eric Wulff: Because you need to trade all the trials up to the same epoch number before you decide which ones to keep, and not
718
14:12:04,140 --> 14:12:11,450
Eric Wulff: so it. It. It doesn't have to do with ray tune per se. It just has to do with the the particular search algorithm or
719
14:12:11,530 --> 14:12:15,219
Eric Wulff: a lot of search algorithms actually work work like that.
720
14:12:18,070 --> 14:12:19,019
Yeah,
721
14:12:19,250 --> 14:12:21,929
Enrico Fermi Institute: you have a question or comment for me in the chat.
722
14:12:22,100 --> 14:12:40,870
Ian Fisk: Yeah, I had a question for Eric which was, and maybe it's too early to tell. But my question was, how stable you expected the hyper parameter tuning to be in the sense that are we expecting that every time we change network or get new data, we're going to have to re-optimize the hyper parameters. Or is this something that
723
14:12:40,880 --> 14:12:50,119
Ian Fisk: um that once we sort of ha I optimize for a particular problem that we may find that those are stable over periods of time. The reason, I ask is that This seems like A.
724
14:12:50,620 --> 14:12:59,900
Ian Fisk: When we talk about the use of Hpc. Or clouds and specialized resources, like training is A is a big part of how we tend to use them. But the hyper parameter
725
14:13:00,190 --> 14:13:11,330
Ian Fisk: optimization sort of increases that by a factor of fifty or so. And so, if we have to do it each time. We probably need to factor those things in in our thoughts about how we're where we're constrained resources.
726
14:13:12,110 --> 14:13:14,099
Eric Wulff: Yeah, so
727
14:13:14,770 --> 14:13:16,039
Eric Wulff: um,
728
14:13:16,760 --> 14:13:23,389
Eric Wulff: it. It would completely depend on how much you change your model, or how much you change the problem.
729
14:13:23,470 --> 14:13:24,989
Eric Wulff: I mean, if you're
730
14:13:25,010 --> 14:13:27,139
Eric Wulff: if you change your model
731
14:13:27,180 --> 14:13:32,739
Eric Wulff: architecture, I it, you will probably have to run a new hyper primary organization.
732
14:13:32,770 --> 14:13:38,310
Eric Wulff: Um, because you might do not even have the same hyper parameters in your model anymore.
733
14:13:38,550 --> 14:13:40,150
Eric Wulff: Uh,
734
14:13:40,610 --> 14:13:56,560
Eric Wulff: and but But you know, if if things aren't two different, you might not have to to hypertune, you might. You might, or just, maybe, as to a small hyper tuning, you know, just a few parameters in in some narrow or small search space.
735
14:13:56,690 --> 14:13:58,640
Eric Wulff: So, for instance,
736
14:13:59,020 --> 14:14:00,809
Eric Wulff: if you look at other
737
14:14:00,840 --> 14:14:01,950
Eric Wulff: uh
738
14:14:02,920 --> 14:14:06,280
Eric Wulff: ah, other fields, such as, for instance, a
739
14:14:06,390 --> 14:14:09,070
Eric Wulff: image recognition, or all the detection.
740
14:14:09,210 --> 14:14:26,879
Eric Wulff: Um, if you find a network that performs well on, you know, classifying certain kinds of objects uh, then it's very likely that they, you know, using the same, have a parameters. It would be good at classifying other kinds of objects as well. If you just have labour data for for those objects.
741
14:14:26,890 --> 14:14:29,329
So in that case, probably you wouldn't have to
742
14:14:31,100 --> 14:14:33,510
Eric Wulff: run a full hyper-prampt organization again.
743
14:14:37,260 --> 14:14:46,599
Ian Fisk: Thanks. It's a it's it's thanks. It's It's very impressive. The amount that it improves the situation by doing the separately. Getting a factor of two is nice
744
14:14:48,460 --> 14:14:49,360
Eric Wulff: Thanks.
745
14:14:50,810 --> 14:15:07,050
Paolo Calafiura (he): A question or comment from Paul. Yes, I hope to the question I miss. I missed the the first couple of nights. Sorry of the question. I wasn't address there. So my question is here You're starting to show the the scaling at four nodes,
746
14:15:07,060 --> 14:15:13,339
Paolo Calafiura (he): and I wonder what would the scaling look like if you compare it with a single null or in a single gpu.
747
14:15:14,870 --> 14:15:16,540
Eric Wulff: Um.
748
14:15:26,890 --> 14:15:32,669
Eric Wulff: The few notes you have the more all this excessive reloading has to happen.
749
14:15:32,930 --> 14:15:37,320
Eric Wulff: So you're just just using one now would be very, very slow.
750
14:15:37,510 --> 14:15:50,440
Paolo Calafiura (he): But that's because of the way does it does this business. It's because of the search algorithm we use. So it's not the way to per. Say It's the
751
14:15:51,360 --> 14:15:58,859
Eric Wulff: it's because of the algorithm you. You wouldn't be able to run this faster with another framework. Well, I mean
752
14:15:59,760 --> 14:16:18,139
Paolo Calafiura (he): it. It. It. It's the algorithms problem, not way, too. So it's It's a little bit harder than to to do the the comparison. I mean, i'm thinking, if you use psychic labels like it, optimize on single Gpu to do to do the same thing. And then, of course, there is the question, What is the
753
14:16:22,910 --> 14:16:26,699
Paolo Calafiura (he): Okay, it's It's a complicated question.
754
14:16:29,870 --> 14:16:32,029
Okay? Next we have
755
14:16:34,400 --> 14:16:45,700
Shigeki: Uh yeah, I'm gonna show my ignorance here. Um, just trying to understand exactly how this works. Uh: I think i'm on the first slide. Second slide.
756
14:16:45,730 --> 14:16:54,140
Shigeki: You show the trial one trial to trial trial, and those trials are independent of each other. Right? They're all working on.
757
14:16:54,440 --> 14:17:12,849
Shigeki: Okay, uh. The next thing here is that presumably they're they're reading the same set of data over in a uh in order to train uh, they don't. They're completely independent in terms of of the of the where they are in the input. Stream. Right? They're. They're not like working in lockstep or anything.
758
14:17:13,630 --> 14:17:25,690
Eric Wulff: This is prior one. So it it. It depends on the kind of the search algorithm that you use the hyper perimeter search algorithm So um
759
14:17:26,590 --> 14:17:27,650
Eric Wulff: in um.
760
14:17:28,350 --> 14:17:40,270
Eric Wulff: Well, to to be with you. You you can choose not to use any particular search algorithm and then everything is just done uh in parallel sort of um,
761
14:17:40,560 --> 14:17:41,710
Eric Wulff: however,
762
14:17:42,000 --> 14:17:53,250
Eric Wulff: and it's it's It's much more efficient to use some kind of search. Algorithm So then um! You would want to train all the trials up to a certain
763
14:17:53,570 --> 14:17:58,200
Eric Wulff: epoch number. Let's say you train them all up to you. Put five, and then you look at
764
14:17:58,530 --> 14:18:08,800
Eric Wulff: uh, they have some algorithm that decides which wants to terminate, and which ones to continue training, and in place of the ones you terminated, you start new trials
765
14:18:08,820 --> 14:18:12,450
Eric Wulff: with the with new hyper parameter configurations.
766
14:18:12,500 --> 14:18:19,529
Eric Wulff: Um. So then, that if you have many more trials, then you have confused notes. You have to
767
14:18:19,720 --> 14:18:27,839
Eric Wulff: pause some of some trials at a point five, and then load in new trials and train them out until they book five.
768
14:18:28,230 --> 14:18:30,749
Shigeki: Okay. So
769
14:18:31,070 --> 14:18:35,280
Shigeki: okay. But to a certain extent, though the the the trials are running independent,
770
14:18:35,290 --> 14:18:51,889
Shigeki: and they get synchronized at some point by by the Atlantic. That you that you're that you're stopping at. But other than that within up to that epoch point uh they're running. They're they're what they're blasting through the the the data as quickly as they they they can. And And so they? They're not in sync. Okay,
771
14:18:52,640 --> 14:18:53,690
Shigeki: thank you.
772
14:18:56,430 --> 14:18:59,330
Enrico Fermi Institute: So how long does it take to run this on,
773
14:18:59,370 --> 14:19:07,800
Enrico Fermi Institute: you know, for for one node? You know. How long is it running the the hyper parameter optimization in terms of all, all all time? Hours?
774
14:19:08,120 --> 14:19:09,599
Eric Wulff: Um!
775
14:19:10,010 --> 14:19:11,059
Eric Wulff: So
776
14:19:11,130 --> 14:19:21,010
Eric Wulff: that that can vary a lot, depending on how large your search basis and the can, what we use and the data that we use, and so on, I think for the for the results I show here.
777
14:19:21,310 --> 14:19:22,860
Eric Wulff: Um
778
14:19:23,820 --> 14:19:26,859
Eric Wulff: uh, If I remember correctly,
779
14:19:27,120 --> 14:19:33,029
Eric Wulff: the whole thing took uh around eighty hours in
780
14:19:33,190 --> 14:19:35,740
Eric Wulff: in wall time,
781
14:19:35,980 --> 14:19:40,909
Eric Wulff: and that's using uh that was using uh twelve
782
14:19:40,930 --> 14:19:45,800
Eric Wulff: confused notes with four to us each.
783
14:19:45,810 --> 14:20:11,110
Enrico Fermi Institute: That can be, you know, trivially broken up into into multiple drops and things like that. The reason I ask is one of the things I notice is that on you know some of the Hpcs uh, at least in the Us. Right. They they have, you know, maximum wall time, for you know you jobs in the queues right? So like I'm i'm looking at, you know pearl matter right now, and it says you can have a a gpu job in the regular queue uh for twelve hours at most.
784
14:20:11,120 --> 14:20:15,659
Enrico Fermi Institute: And so i'm wondering like, what what useful work can we get done, or
785
14:20:15,870 --> 14:20:25,280
Enrico Fermi Institute: you know, hyperparameter, optimization or machine learning in general, you know, given the relatively short maximum of all time.
786
14:20:25,450 --> 14:20:29,280
Eric Wulff: Um. So one solution is to uh
787
14:20:29,460 --> 14:20:31,290
Eric Wulff: tick points. The
788
14:20:31,950 --> 14:20:39,149
Eric Wulff: the the search, and then just launch it again and continue where you left off. So the we're able to do that. So
789
14:20:39,190 --> 14:20:44,300
Eric Wulff: we are saving checkpoints regularly through the the workload.
790
14:20:45,570 --> 14:20:47,679
Eric Wulff: Okay? And uh, yeah,
791
14:20:47,820 --> 14:20:50,360
Enrico Fermi Institute: how often do you save the checkpoints?
792
14:20:51,280 --> 14:21:07,169
Eric Wulff: Um, That's configurable, But usually once per epoch. So once once per read through data sets.
793
14:21:08,020 --> 14:21:15,920
Eric Wulff: Uh that. That depends a lot also. But um, let's say you around well between twelve and twenty four hours.
794
14:21:17,110 --> 14:21:20,540
Eric Wulff: But this completely depends on how much data you have. And uh,
795
14:21:21,140 --> 14:21:24,060
Eric Wulff: you know the the particular model they use.
796
14:21:24,530 --> 14:21:41,880
Enrico Fermi Institute: That's an epoch for the hyper parameter optimization itself, not just the the neural net a single instance of the neural network
797
14:21:42,740 --> 14:21:45,710
twenty-four hours for a single,
798
14:21:46,740 --> 14:21:53,449
Eric Wulff: and that's um. So that you know we have quite a big data set. So that's
799
14:21:53,510 --> 14:22:00,430
Eric Wulff: why. But we're also using four G four, and the J. One hundred gpus for that. So
800
14:22:00,820 --> 14:22:02,320
Eric Wulff: if you have a
801
14:22:02,640 --> 14:22:05,420
Eric Wulff: all the gpus that would take much longer,
802
14:22:08,980 --> 14:22:19,460
Enrico Fermi Institute: I I guess What I'm wondering is, you know, for for the report, should we, you know, have some recommendation that the the policies at these sites. You know how
803
14:22:20,140 --> 14:22:25,540
Enrico Fermi Institute: you know much longer Gpu jobs to run to do these sorts of tasks.
804
14:22:26,090 --> 14:22:29,069
Eric Wulff: Well, my opinion is that it would be
805
14:22:29,720 --> 14:22:47,669
Enrico Fermi Institute: it would be convenient to see if it if we could. But you know it's not deal breaking, because we can't checkpoint this, and just to relo right. But can you for it? You just said your your epochs are twelve to twenty-four hours, and Lincoln just said that
806
14:22:47,720 --> 14:22:57,990
Eric Wulff: twelve hours. So the sorry sorry sorry. So I uh, yeah, yeah, I I I I spoke here so
807
14:22:58,500 --> 14:23:13,459
Eric Wulff: uh apologies. It's a bit late over here. So it it takes it takes twenty-four hours for a full training. Not for one.
808
14:23:13,470 --> 14:23:33,439
Enrico Fermi Institute: We're not asking for a policy change, right? Just a behavioral change with checkpointing. And you're saving at the end of each full training or each actual. So it's as much. Uh: yeah, yeah, Sorry for it. You have, like two hundred epochs. Is that right? Yeah, you're probably having the plot.
809
14:23:33,650 --> 14:23:37,789
Eric Wulff: Uh: yeah, yeah, in the plot here. So Um:
810
14:23:38,030 --> 14:23:56,069
Eric Wulff: yeah. And so the this is plot from last year. Now we have a large data set, and we train for about a hundred epochs, and that takes uh roughly, twenty four hours.
811
14:23:57,900 --> 14:23:59,820
Enrico Fermi Institute: Okay, Um,
812
14:24:00,170 --> 14:24:13,310
Enrico Fermi Institute: yeah, with adding more gpus per node help you in terms of a number of epochs? Or do you have enough data to get reasonable convergence with, or at least with this model after one hundred? You
813
14:24:21,110 --> 14:24:22,430
Eric Wulff: actually we are.
814
14:24:22,690 --> 14:24:27,659
Eric Wulff: We just saw that if we scale up our model
815
14:24:27,690 --> 14:24:40,729
Eric Wulff: significantly, so make make the model larger. With many more parameters we can easily improve the physics performance. Um. So we just try that the
816
14:24:41,300 --> 14:24:44,330
Eric Wulff: this week,
817
14:24:44,660 --> 14:24:47,859
Eric Wulff: because we were curious. Basically Uh, however,
818
14:24:47,920 --> 14:24:49,790
Eric Wulff: that's sort of not a
819
14:24:58,390 --> 14:25:02,050
Eric Wulff: quickly enough in production, anyway.
820
14:25:02,590 --> 14:25:03,639
Eric Wulff: Um,
821
14:25:06,150 --> 14:25:08,350
Eric Wulff: but it sort of shows that the
822
14:25:08,440 --> 14:25:17,159
Eric Wulff: there is enough information in the data to do better. We just uh need to improve the model or the the training of the model somehow.
823
14:25:20,160 --> 14:25:25,100
Enrico Fermi Institute: Okay, Um, see, you have your hand raised.
824
14:25:25,830 --> 14:25:42,530
Shigeki: Uh, yeah, I just have a question on in terms of the amount of data you're going through, and the model size. Uh, I guess that's measured in terms of number of parameters as well as hyper parameters. And whether or not This Is Is there a Is there a a a a
825
14:25:42,540 --> 14:25:54,120
Shigeki: size that that physics problems, and in atp tend to gravitate to, or it can be all over the map in terms of model size and data, set size and and number of hyper parameters.
826
14:25:55,040 --> 14:25:56,179
Eric Wulff: Um!
827
14:25:56,320 --> 14:26:00,129
Eric Wulff: So the number of heavy parameters. Um,
828
14:26:00,190 --> 14:26:07,620
Eric Wulff: that's a little bit arbitrary, dependent on what you mean with have parameters. So if you
829
14:26:08,040 --> 14:26:10,180
Eric Wulff: uh if you count
830
14:26:10,250 --> 14:26:11,389
Eric Wulff: well,
831
14:26:11,430 --> 14:26:13,889
Eric Wulff: you you you can configure
832
14:26:14,040 --> 14:26:23,330
Eric Wulff: but very many things with our model. So if you, if you count all those hyper parameters, I don't know how many they are, but there are hundreds, and we don't two, not of them, because they're too many.
833
14:26:28,100 --> 14:26:33,720
Eric Wulff: Uh, the the number of trainable parameters in the model is around one million,
834
14:26:34,130 --> 14:26:37,850
Eric Wulff: so that's fairly small, if you
835
14:26:37,890 --> 14:26:39,450
Eric Wulff: compared with other uh
836
14:26:40,090 --> 14:26:46,880
Eric Wulff: other sciences, like image recognition, or natural language processing, then this is really a small model.
837
14:26:47,030 --> 14:26:48,389
Eric Wulff: Um!
838
14:26:48,570 --> 14:26:50,480
Eric Wulff: How we we think that
839
14:26:50,580 --> 14:26:52,679
Eric Wulff: I I actually don't know the
840
14:26:53,190 --> 14:26:57,809
Eric Wulff: the memory requirements that we have to uh
841
14:26:57,850 --> 14:27:05,289
Eric Wulff: that here, too, if this would go into production at some point in the future. But I don't think we could go much larger
842
14:27:05,410 --> 14:27:19,759
Eric Wulff: uh, at least not without uh doing some kind of conversation. Uh, we're training or post training, conversation, or perhaps pruding weights after training or doing some other tricks like that
843
14:27:19,990 --> 14:27:23,109
Eric Wulff: uh data set size. So the
844
14:27:23,680 --> 14:27:26,389
Eric Wulff: the one we are currently using.
845
14:27:30,540 --> 14:27:34,559
Eric Wulff: I think it's around four hundred thousand events
846
14:27:35,000 --> 14:27:38,260
Eric Wulff: collision events of of the the different kinds.
847
14:27:40,140 --> 14:27:44,790
Shigeki: Do you have an approximate idea of how much actual gigabytes that is?
848
14:27:45,140 --> 14:27:46,559
Eric Wulff: Um
849
14:27:47,210 --> 14:27:48,730
Shigeki: auto-
850
14:27:49,250 --> 14:27:51,920
Eric Wulff: is a few hundred gigabytes
851
14:27:52,100 --> 14:27:54,480
Eric Wulff: less than a thousand,
852
14:27:55,530 --> 14:28:08,920
Shigeki: and presumably when you're when you're running this, it's, it's it's it's it's compute bound not not not uh a I o bound from from uh in terms of feeding the they uh, the the training data,
853
14:28:08,950 --> 14:28:11,229
Shigeki: or it depends.
854
14:28:11,450 --> 14:28:18,439
Eric Wulff: No, I would say it's compute bound. Oh, you mean looking at the Gpu utilization. It goes to
855
14:28:18,590 --> 14:28:20,070
Eric Wulff: it close to one hundred.
856
14:28:20,139 --> 14:28:22,229
Shigeki: Mhm Okay, thanks.
857
14:28:22,559 --> 14:28:27,009
Enrico Fermi Institute: And you know how much of the memory and the Gpu you're using, or have you?
858
14:28:27,570 --> 14:28:30,279
Eric Wulff: Uh, yes, we uh we,
859
14:28:30,400 --> 14:28:33,209
Eric Wulff: you see, all of it. Basically
860
14:28:34,049 --> 14:28:40,529
Enrico Fermi Institute: So then you're not. It would not help you to have centers that chop up these big gpus.
861
14:28:41,969 --> 14:28:45,769
Eric Wulff: I don't think so. Um. So there is a problem.
862
14:28:45,930 --> 14:28:57,160
Eric Wulff: Um, with having two large batch sizes sometimes. Um basically in order to fill up the gpu. You you increase the bad size as you use for training.
863
14:28:57,230 --> 14:28:58,449
Eric Wulff: Um,
864
14:28:59,530 --> 14:29:05,829
Eric Wulff: and that means you can push more date, though,
865
14:29:05,850 --> 14:29:14,719
Eric Wulff: through per time units, but you know it. It doesn't necessarily mean you can do more optimization steps. So you you might not
866
14:29:14,879 --> 14:29:17,020
Eric Wulff: uh reach
867
14:29:17,160 --> 14:29:20,090
Eric Wulff: the same accuracy quicker.
868
14:29:26,629 --> 14:29:38,190
Eric Wulff: It's it's not obvious or so it's always the case that you can just uh throw more memory at it than it helps. Yeah, I was actually thinking of swapping it the other way with. Uh,
869
14:29:38,990 --> 14:29:45,470
Enrico Fermi Institute: we have a question in our data center of how much we should chop up using Meg the a one hundreds,
870
14:29:47,480 --> 14:29:50,440
Enrico Fermi Institute: you know. Give person a whole
871
14:29:51,010 --> 14:29:54,830
Enrico Fermi Institute: eighty gigs. Were split it up two ways or four ways
872
14:29:55,139 --> 14:30:03,550
Eric Wulff: uh to to several users at the same time.
873
14:30:05,549 --> 14:30:06,580
Enrico Fermi Institute: Thanks.
874
14:30:07,530 --> 14:30:09,519
Enrico Fermi Institute: Show another comment:
875
14:30:12,860 --> 14:30:17,950
Enrico Fermi Institute: Sorry I got to the
876
14:30:18,650 --> 14:30:27,329
Dirk: yeah. I I had a question, and it's not so much. I mean, Eric, if you know you can answer, but it's more uh looking at broader,
877
14:30:27,559 --> 14:30:38,899
Dirk: the and more broader impact of that, and follow on because this is this is interesting, and this is on the But What's the next step? Have there been any discussions how
878
14:30:38,969 --> 14:30:41,610
Dirk: to integrate this in like?
879
14:30:41,700 --> 14:30:58,269
Dirk: Eventually? You You said It's work. It's improving particle. Flow. So eventually it should feed back into the Re. How we run the the reconstruction? Basically, And then the question comes, uh, what, how would you actually deploy this? How often do you have to run it?
880
14:30:58,540 --> 14:31:19,770
Dirk: How long does it take? And and how often do I have to renew like, renew it Basically, with new data to to check that the parameters are still okay and has has, and it's not just a question about the specific thing that So this is like the larger questions. Maybe Lindsay or I don't know if Mike might ask for Link connected if there have been any
881
14:31:19,780 --> 14:31:26,789
Dirk: discussions of that already, or or if that's still to come after the on. The initial on D is done.
882
14:31:30,130 --> 14:31:33,150
Eric Wulff: Well, I would say, if uh,
883
14:31:33,470 --> 14:31:36,980
Eric Wulff: if we are able to prove, or
884
14:31:37,030 --> 14:31:38,920
Eric Wulff: somehow show, that
885
14:31:39,020 --> 14:31:43,090
Eric Wulff: this machine learned approach to particle flow works
886
14:31:43,170 --> 14:31:44,490
Eric Wulff: uh
887
14:31:44,880 --> 14:31:52,579
Eric Wulff: as well, but more efficiently, or or even uh better than the uh
888
14:31:52,610 --> 14:31:54,660
Eric Wulff: method that are used at the moment.
889
14:31:55,670 --> 14:31:59,449
Eric Wulff: Um. Then we then we sort of free that model and
890
14:31:59,690 --> 14:32:04,779
Eric Wulff: get it into production, and then we shouldn't need to redo any hyper,
891
14:32:04,820 --> 14:32:34,339
Dirk: current documentation or anything like that. Then it, you know we Then it's like having a finished algorithm, that. Just Yeah. But the data taking the detector changes all the time. So who knows if the twenty right. If if the training you did on two thousand and twenty-two data, or even run two data is still valid for your next set of data. That's right. So we're we're not. We're not training on date, but we're trying a simulation. Okay, right. But but I think this is when we talk about these kind of a problems, and one of things needs to be studied
892
14:32:34,580 --> 14:32:44,590
Ian Fisk: is how stable these are, and whether they really like, cause it could be that we're incredibly lucky, and they once you hype once you do the hyper parameter optimization that it's applicable to
893
14:32:45,180 --> 14:32:51,009
Ian Fisk: small changes in data. Um, And one thing that this I think we can see from Eric's plots is that it?
894
14:32:51,050 --> 14:33:01,189
Ian Fisk: It makes these things faster. They train faster and better after they can optimize. And so if we were in unreasonably lucky, they'll actually save us resources.
895
14:33:02,360 --> 14:33:03,300
Okay,
896
14:33:03,500 --> 14:33:08,860
Dirk: Okay. But it sounds like It's a discussion that's still to come. That's not. We're not quite there yet.
897
14:33:09,400 --> 14:33:25,109
Ian Fisk: Well, I think so. I think the the thing we do is we given how much this improves the situation where chances are. And I think this is applies to multiple science fields, not just ourselves, that we should be factoring these things in in our discussion about how we're going to use Hc.
898
14:33:25,140 --> 14:33:35,829
Ian Fisk: Um for the report. Um! And then we'll have to wait and see as to whether this thing that's a a workful that we're constantly running, or one that we are running once in a while.
899
14:33:39,190 --> 14:33:47,179
Mike Hildreth: Yeah, I guess I would agree with that. Um, I don't. Yeah, we haven't had A. We. We don't have enough data.
900
14:33:47,840 --> 14:33:53,670
Mike Hildreth: How often we're going to have to train these. But this use case is certainly in the planning.
901
14:33:54,080 --> 14:33:55,760
Enrico Fermi Institute: Is it right?
902
14:33:55,850 --> 14:34:07,809
Enrico Fermi Institute: I think the one remaining worry is, we haven't been through like a complete recalibration cycle of the detector. Uh uh, after a stop or anything like that to see if
903
14:34:07,820 --> 14:34:21,400
Enrico Fermi Institute: to see if it or to see how robust a single training is, or the most optimal training is. With respect to the changing parameters of the detector, and it's just something we have to find out. But it's not going to change the pattern. All that much to be honest.
904
14:34:21,410 --> 14:34:28,360
Enrico Fermi Institute: But yeah, I agree with the in here. It's this: This is probably going to save us resources as well in the long run.
905
14:34:28,620 --> 14:34:30,320
Dirk: Okay, thanks.
906
14:34:30,510 --> 14:34:38,550
Dirk: That makes it difficult for us to write because we can write the use case in, but it's extremely hard to attach any numbers to it at the moment.
907
14:34:41,470 --> 14:34:55,099
Enrico Fermi Institute: Yeah, I mean, we, I guess, to another way to summarize it. We've shown that this works, and that we can get really great results out of it, but we haven't understood the true uh, you know, steady state operational parameters of of this system.
908
14:34:59,230 --> 14:35:04,370
Eric Wulff: And just to be clear like you, there there still needs to be a
909
14:35:04,610 --> 14:35:08,699
Eric Wulff: quite a bit of work before this would be ready to go into production.
910
14:35:09,140 --> 14:35:10,600
Eric Wulff: It's still
911
14:35:10,880 --> 14:35:14,050
Eric Wulff: uh like we, we, we don't understand
912
14:35:14,200 --> 14:35:18,509
Eric Wulff: all the properties of how it reconstructs particles well enough. Yet,
913
14:35:20,650 --> 14:35:23,980
Eric Wulff: although you know it's looking good, it's. It's looking promising,
914
14:35:24,230 --> 14:35:30,350
Eric Wulff: but it it needs to be validated and much more before production.
915
14:35:41,060 --> 14:35:44,129
Enrico Fermi Institute: So we have more question, for
916
14:35:46,660 --> 14:35:50,649
Enrico Fermi Institute: I guess one silly question
917
14:35:51,140 --> 14:36:03,900
Enrico Fermi Institute: in terms of actually trying to use this like in Cmssw. And this is mostly because I don't remember the last time that Joseph presented this, How fast does this go per event in inference mode?
918
14:36:04,220 --> 14:36:06,810
Enrico Fermi Institute: How many, what's the throughput look like?
919
14:36:06,940 --> 14:36:24,380
Eric Wulff: Um, I don't think we have done anything there that would be comparable to it. Production? So it um, or maybe an even better question is, what's what's the memory footprint look like on Gpu or Cpu
920
14:36:24,770 --> 14:36:31,000
Eric Wulff: uh, I don't know that on top of my head, but I know we have a plot somewhere that I can
921
14:36:31,100 --> 14:36:32,899
Enrico Fermi Institute: all good. Thank you.
922
14:36:37,540 --> 14:36:46,069
Enrico Fermi Institute: Okay. There are no other questions, and we can. You can move on. Ah, thank you very much for the presentation, Eric.
923
14:36:46,360 --> 14:36:48,119
Eric Wulff: No problem. Thanks for listening.