[Easter time]

CERN presentation

[Eastern Time]

CERN presentation

615
14:00:29,710 --> 14:00:34,679
Enrico Fermi Institute: I think this is the last session that's focused exclusively on cloud.

616
14:00:36,900 --> 14:00:37,920
Yeah.

617
14:00:38,670 --> 14:00:44,219
Enrico Fermi Institute: In the next session we'll talk about some R and D things and and networking. So

618
14:00:52,660 --> 14:00:57,720
Enrico Fermi Institute: okay, so maybe we break here and we'll we'll uh see everybody at one o'clock central time,

619
14:00:58,540 --> 14:01:00,130
Fernando Harald Barreiro Megino: so you know

620
14:01:01,310 --> 14:01:02,620
Enrico Fermi Institute: she learning,

621
14:01:03,820 --> 14:01:09,699
Enrico Fermi Institute: and then we'll we'll go back to the the topics as presented in the slides,

622
14:01:10,610 --> 14:01:12,850
Enrico Fermi Institute: so we'll just get started in a few minutes here,

623
14:01:53,380 --> 14:01:56,370
Maria Girone: so it's a Eric starting first, right.

624
14:01:56,540 --> 14:02:04,520
Maria Girone: Yeah, if Eric is ready to present, we thought maybe it would be best to just have them

625
14:02:07,990 --> 14:02:16,680
Enrico Fermi Institute: getting a little bit late. Concerned. Yeah, exactly. We want to be considerate of people's time in your especially. Thank you.

626
14:02:29,590 --> 14:02:39,579
Enrico Fermi Institute: So just give it like two more minutes, and then um, Eric, whenever you're ready to, you know. Put your slides up. I'll I'll stop sharing here. Um, when we get started shortly.

627
14:02:42,740 --> 14:02:47,450
Eric Wulff: Sounds good. I'm uh ready. Whenever So just let me know. Okay,

628
14:02:48,630 --> 14:02:49,570
you

629
14:02:54,390 --> 14:02:55,219
The

630
14:03:09,350 --> 14:03:17,650
Enrico Fermi Institute: It seems like the rate at which people have. Uh that rejoining has has slowed down significantly. So I think you can go ahead and and and and start

631
14:03:22,080 --> 14:03:23,529
Eric Wulff: uh, Okay.

632
14:03:24,610 --> 14:03:25,870
Eric Wulff: So

633
14:03:27,290 --> 14:03:31,050
Eric Wulff: i'm sharing. Now, I think. Can you see?

634
14:03:31,340 --> 14:03:33,999
Eric Wulff: Yes, it looks good. Okay, great.

635
14:03:34,560 --> 14:03:37,929
Eric Wulff: Um. So I I just have a

636
14:03:38,180 --> 14:03:52,689
Eric Wulff: two or three slides here. So it's a very short presentation just to talk a little bit about what we have been doing uh regarding distributed training and hypertuning uh of deep learning based algorithms using you have from us computing.

637
14:03:53,360 --> 14:04:00,499
Eric Wulff: So this is something that I have been doing in context of the A Eu Funded Research project called Say, We race

638
14:04:06,260 --> 14:04:08,620
Eric Wulff: involved in this, and she's my supervisor.

639
14:04:09,580 --> 14:04:10,969
Um.

640
14:04:12,850 --> 14:04:15,450
So let's see if I can change slide.

641
14:04:15,770 --> 14:04:17,940
Eric Wulff: Yes, um.

642
14:04:18,590 --> 14:04:24,429
Eric Wulff: So just for for if you're not aware uh hyper parameter organization. Um.

643
14:04:25,320 --> 14:04:35,079
Eric Wulff: So if you're not aware of what that is, I've tried to use it very quickly here in just one slide. So it's. I will sometimes refer to it as as a hyper tuning,

644
14:04:35,140 --> 14:04:36,670
Eric Wulff: and um,

645
14:04:36,730 --> 14:04:39,300
Eric Wulff: it's basically to um

646
14:04:39,340 --> 14:04:49,350
Eric Wulff: to tune the uh hyper parameters all the an Ai model or a deep learning model, and hyper parameters are simply the model sets. Um

647
14:04:58,840 --> 14:05:09,139
Eric Wulff: um, and they can define things like the model architecture. So, for instance, how many layers you have in your neural network? Um, How many notes you have in each layer, and so on.

648
14:05:09,520 --> 14:05:19,239
Eric Wulff: Um, but they also define things. Um, that has to do with the optimization of the model, such as the learning rates, the back size and so forth.

649
14:05:19,720 --> 14:05:20,570
Yeah.

650
14:05:22,180 --> 14:05:28,950
Eric Wulff: Now, if you have a a a large model, or a very top complex model, which it requires a lot of compute to

651
14:05:29,220 --> 14:05:30,469
Eric Wulff: and

652
14:05:31,480 --> 14:05:33,510
Eric Wulff: uh, to do the forward pass,

653
14:05:33,610 --> 14:05:34,950
Eric Wulff: and

654
14:05:35,630 --> 14:05:38,329
Eric Wulff: and or you have a large data sets.

655
14:05:38,360 --> 14:05:41,660
Eric Wulff: Um. Hypertine can be extremely

656
14:05:41,940 --> 14:05:56,630
Eric Wulff: compute resource intensive. So, therefore it can benefit greatly from Hbc. Resources. And uh, Furthermore, we need a of smart and efficient solid search algorithms to find good hyper parameters, so that we we don't waste the Hpc resources that we have

657
14:05:59,290 --> 14:06:00,480
Eric Wulff: um.

658
14:06:01,000 --> 14:06:10,500
Eric Wulff: So in race uh, I have been working with uh a group working on machine and particle flow uh, which is a

659
14:06:10,810 --> 14:06:13,939
Eric Wulff: uh in collaboration with Cms

660
14:06:14,080 --> 14:06:17,230
Eric Wulff: with people from Cms. Um, And

661
14:06:17,420 --> 14:06:19,599
Eric Wulff: in order to high opportunity, this model

662
14:06:19,690 --> 14:06:25,310
Eric Wulff: um in race we have been using uh an open source framework called rate you

663
14:06:25,750 --> 14:06:34,059
Eric Wulff: uh, which allows us to run many different trials in parallel, using uh multiple gpus per trial

664
14:06:34,270 --> 14:06:39,010
Eric Wulff: uh, which is uh what this picture up here is trying to represent.

665
14:06:39,570 --> 14:06:40,990
Eric Wulff: And

666
14:06:42,990 --> 14:06:51,389
Eric Wulff: now, with Rachel we can also get the very nice overview of the different trials, and we can. We can pick the one that we see, performs the best

667
14:06:51,580 --> 14:06:57,289
Eric Wulff: uh and right, and also has a lot of different search algorithms that uh

668
14:06:57,660 --> 14:07:01,359
Eric Wulff: help us to in the the right uh

669
14:07:01,690 --> 14:07:02,970
Eric Wulff: I, the parameters.

670
14:07:03,430 --> 14:07:18,949
Eric Wulff: And here, to the right, we have an example of of the kind of a difference this can make to to the learning of the model. So Here we have plotted the um training and validation losses for, and after hyper tuning,

671
14:07:20,620 --> 14:07:32,120
Eric Wulff: so as you can see here, the the loss went down quite a bit after hypertuning almost by a factor of two, and the furthermore, the the training seems to be much more stable. We have a

672
14:07:32,380 --> 14:07:36,559
Eric Wulff: these bands which will present the the standard deviation of

673
14:07:36,750 --> 14:07:42,170
Eric Wulff: between different trainings. It's it's much more stable in the right plot.

674
14:07:47,030 --> 14:07:56,090
Eric Wulff: Um and I just had one more slide here to sort of illustrate how you still uh high performance computing can be in order to speed up

675
14:07:56,810 --> 14:07:58,380
parameter optimization.

676
14:07:58,560 --> 14:08:03,430
Eric Wulff: Uh. So this just shows the scaling uh from four to twenty-four

677
14:08:03,680 --> 14:08:05,309
Eric Wulff: computing notes.

678
14:08:05,330 --> 14:08:06,550
Eric Wulff: Um,

679
14:08:06,990 --> 14:08:15,439
Eric Wulff: maybe particularly looking at the plot to the right here we can see that the scaling for this use case is actually better than linear

680
14:08:15,570 --> 14:08:20,269
Eric Wulff: um, which at least in part has to do with, uh

681
14:08:20,820 --> 14:08:26,109
Eric Wulff: some excessive reloading of models that happens when when we have the few notes.

682
14:08:28,060 --> 14:08:29,150
Eric Wulff: Um.

683
14:08:31,070 --> 14:08:35,830
Eric Wulff: So. Um: Well, this basically means that the more the more

684
14:08:36,030 --> 14:08:41,099
Eric Wulff: notes we have, the more people we have with the faster we can tune and apply these bottles.

685
14:08:41,670 --> 14:08:47,480
Eric Wulff: That's all I had for for this.

686
14:08:48,740 --> 14:08:58,029
Enrico Fermi Institute: Can you tell a priori from the model that that you'll that the model you're using will

687
14:08:58,080 --> 14:09:04,340
Enrico Fermi Institute: force up behavior, so that if someone comes with any given model, you know how to sort of shape the work,

688
14:09:06,550 --> 14:09:15,609
Enrico Fermi Institute: if you understand what I mean and no, no. What I mean is, you discovered that you get better than linear scaling with this training?

689
14:09:15,700 --> 14:09:16,719
Right?

690
14:09:17,160 --> 14:09:22,499
Enrico Fermi Institute: That's not always the case with, Or is that the case with any given model.

691
14:09:23,150 --> 14:09:24,459
Um,

692
14:09:25,150 --> 14:09:33,199
Eric Wulff: yeah, I think it would be so. This is sort of uh. This is showing the scaling of the hyper parameter organization itself.

693
14:09:33,650 --> 14:09:40,180
Eric Wulff: Um, so it's not. If if you had just a single training, it wouldn't scale like this it would be

694
14:09:40,360 --> 14:09:42,610
Eric Wulff: a a bit worse than linear probably.

695
14:09:45,610 --> 14:09:51,289
Eric Wulff: But So the way that the hypertuning works in this case is that we

696
14:09:51,430 --> 14:09:53,199
Eric Wulff: we launched a bunch of

697
14:09:53,690 --> 14:09:56,980
Eric Wulff: trials in parallel with different type of parameter

698
14:09:57,010 --> 14:09:58,559
Eric Wulff: configurations.

699
14:09:58,990 --> 14:10:00,189
Eric Wulff: And then

700
14:10:00,340 --> 14:10:01,780
Eric Wulff: um!

701
14:10:02,230 --> 14:10:10,820
Eric Wulff: There is a sort of a scheduling or search algorithm, looking at how well all these trials perform,

702
14:10:10,940 --> 14:10:22,829
Eric Wulff: and then it's a terminates once that look less promising and continuous training, the ones that look promising. And then we can also have some kind of base and optimization

703
14:10:23,190 --> 14:10:26,360
Eric Wulff: component here, which tries to predict which

704
14:10:27,470 --> 14:10:31,230
Eric Wulff: hyper parameters would perform. Well, and then we try those next,

705
14:10:32,930 --> 14:10:39,059
Enrico Fermi Institute: and if you were to double or triple the number of nodes you would continue to does the

706
14:10:39,310 --> 14:10:42,929
Enrico Fermi Institute: does the actual growth begin to flat now?

707
14:10:43,430 --> 14:11:00,910
Eric Wulff: Um! I I haven't tested this um more than up to twenty four notes uh, so I can't say for sure, but I I imagine it will continue for at least a bit more. But um I I can't say for how long, and

708
14:11:01,060 --> 14:11:16,039
Enrico Fermi Institute: I I also see that eventually it would flack off.

709
14:11:17,080 --> 14:11:18,540
Eric Wulff: Um,

710
14:11:19,510 --> 14:11:23,909
Enrico Fermi Institute: because that's all it. Yeah, it it's nothing. The issue is a resource contention.

711
14:11:24,600 --> 14:11:30,520
Eric Wulff: Yeah, it's a that has to do with the with this search. Algorithm That's

712
14:11:30,630 --> 14:11:32,309
Eric Wulff: um

713
14:11:33,180 --> 14:11:39,990
Eric Wulff: trains a few trials and then terminates bad once, and then continues with new ones. So

714
14:11:40,360 --> 14:11:48,789
Eric Wulff: if you have more more trials than you have notes that that you want to run uh. You have to sort of the

715
14:11:49,280 --> 14:11:54,179
Eric Wulff: post trials at some point, and can and start training other ones.

716
14:11:54,590 --> 14:11:56,110
Eric Wulff: Um!

717
14:11:56,270 --> 14:12:02,699
Eric Wulff: Because you need to trade all the trials up to the same epoch number before you decide which ones to keep, and not

718
14:12:04,140 --> 14:12:11,450
Eric Wulff: so it. It. It doesn't have to do with ray tune per se. It just has to do with the the particular search algorithm or

719
14:12:11,530 --> 14:12:15,219
Eric Wulff: a lot of search algorithms actually work work like that.

720
14:12:18,070 --> 14:12:19,019
Yeah,

721
14:12:19,250 --> 14:12:21,929
Enrico Fermi Institute: you have a question or comment for me in the chat.

722
14:12:22,100 --> 14:12:40,870
Ian Fisk: Yeah, I had a question for Eric which was, and maybe it's too early to tell. But my question was, how stable you expected the hyper parameter tuning to be in the sense that are we expecting that every time we change network or get new data, we're going to have to re-optimize the hyper parameters. Or is this something that

723
14:12:40,880 --> 14:12:50,119
Ian Fisk: um that once we sort of ha I optimize for a particular problem that we may find that those are stable over periods of time. The reason, I ask is that This seems like A.

724
14:12:50,620 --> 14:12:59,900
Ian Fisk: When we talk about the use of Hpc. Or clouds and specialized resources, like training is A is a big part of how we tend to use them. But the hyper parameter

725
14:13:00,190 --> 14:13:11,330
Ian Fisk: optimization sort of increases that by a factor of fifty or so. And so, if we have to do it each time. We probably need to factor those things in in our thoughts about how we're where we're constrained resources.

726
14:13:12,110 --> 14:13:14,099
Eric Wulff: Yeah, so

727
14:13:14,770 --> 14:13:16,039
Eric Wulff: um,

728
14:13:16,760 --> 14:13:23,389
Eric Wulff: it. It would completely depend on how much you change your model, or how much you change the problem.

729
14:13:23,470 --> 14:13:24,989
Eric Wulff: I mean, if you're

730
14:13:25,010 --> 14:13:27,139
Eric Wulff: if you change your model

731
14:13:27,180 --> 14:13:32,739
Eric Wulff: architecture, I it, you will probably have to run a new hyper primary organization.

732
14:13:32,770 --> 14:13:38,310
Eric Wulff: Um, because you might do not even have the same hyper parameters in your model anymore.

733
14:13:38,550 --> 14:13:40,150
Eric Wulff: Uh,

734
14:13:40,610 --> 14:13:56,560
Eric Wulff: and but But you know, if if things aren't two different, you might not have to to hypertune, you might. You might, or just, maybe, as to a small hyper tuning, you know, just a few parameters in in some narrow or small search space.

735
14:13:56,690 --> 14:13:58,640
Eric Wulff: So, for instance,

736
14:13:59,020 --> 14:14:00,809
Eric Wulff: if you look at other

737
14:14:00,840 --> 14:14:01,950
Eric Wulff: uh

738
14:14:02,920 --> 14:14:06,280
Eric Wulff: ah, other fields, such as, for instance, a

739
14:14:06,390 --> 14:14:09,070
Eric Wulff: image recognition, or all the detection.

740
14:14:09,210 --> 14:14:26,879
Eric Wulff: Um, if you find a network that performs well on, you know, classifying certain kinds of objects uh, then it's very likely that they, you know, using the same, have a parameters. It would be good at classifying other kinds of objects as well. If you just have labour data for for those objects.

741
14:14:26,890 --> 14:14:29,329
So in that case, probably you wouldn't have to

742
14:14:31,100 --> 14:14:33,510
Eric Wulff: run a full hyper-prampt organization again.

743
14:14:37,260 --> 14:14:46,599
Ian Fisk: Thanks. It's a it's it's thanks. It's It's very impressive. The amount that it improves the situation by doing the separately. Getting a factor of two is nice

744
14:14:48,460 --> 14:14:49,360
Eric Wulff: Thanks.

745
14:14:50,810 --> 14:15:07,050
Paolo Calafiura (he): A question or comment from Paul. Yes, I hope to the question I miss. I missed the the first couple of nights. Sorry of the question. I wasn't address there. So my question is here You're starting to show the the scaling at four nodes,

746
14:15:07,060 --> 14:15:13,339
Paolo Calafiura (he): and I wonder what would the scaling look like if you compare it with a single null or in a single gpu.

747
14:15:14,870 --> 14:15:16,540
Eric Wulff: Um.

748
14:15:26,890 --> 14:15:32,669
Eric Wulff: The few notes you have the more all this excessive reloading has to happen.

749
14:15:32,930 --> 14:15:37,320
Eric Wulff: So you're just just using one now would be very, very slow.

750
14:15:37,510 --> 14:15:50,440
Paolo Calafiura (he): But that's because of the way does it does this business. It's because of the search algorithm we use. So it's not the way to per. Say It's the

751
14:15:51,360 --> 14:15:58,859
Eric Wulff: it's because of the algorithm you. You wouldn't be able to run this faster with another framework. Well, I mean

752
14:15:59,760 --> 14:16:18,139
Paolo Calafiura (he): it. It. It. It's the algorithms problem, not way, too. So it's It's a little bit harder than to to do the the comparison. I mean, i'm thinking, if you use psychic labels like it, optimize on single Gpu to do to do the same thing. And then, of course, there is the question, What is the

753
14:16:22,910 --> 14:16:26,699
Paolo Calafiura (he): Okay, it's It's a complicated question.

754
14:16:29,870 --> 14:16:32,029
Okay? Next we have

755
14:16:34,400 --> 14:16:45,700
Shigeki: Uh yeah, I'm gonna show my ignorance here. Um, just trying to understand exactly how this works. Uh: I think i'm on the first slide. Second slide.

756
14:16:45,730 --> 14:16:54,140
Shigeki: You show the trial one trial to trial trial, and those trials are independent of each other. Right? They're all working on.

757
14:16:54,440 --> 14:17:12,849
Shigeki: Okay, uh. The next thing here is that presumably they're they're reading the same set of data over in a uh in order to train uh, they don't. They're completely independent in terms of of the of the where they are in the input. Stream. Right? They're. They're not like working in lockstep or anything.

758
14:17:13,630 --> 14:17:25,690
Eric Wulff: This is prior one. So it it. It depends on the kind of the search algorithm that you use the hyper perimeter search algorithm So um

759
14:17:26,590 --> 14:17:27,650
Eric Wulff: in um.

760
14:17:28,350 --> 14:17:40,270
Eric Wulff: Well, to to be with you. You you can choose not to use any particular search algorithm and then everything is just done uh in parallel sort of um,

761
14:17:40,560 --> 14:17:41,710
Eric Wulff: however,

762
14:17:42,000 --> 14:17:53,250
Eric Wulff: and it's it's It's much more efficient to use some kind of search. Algorithm So then um! You would want to train all the trials up to a certain

763
14:17:53,570 --> 14:17:58,200
Eric Wulff: epoch number. Let's say you train them all up to you. Put five, and then you look at

764
14:17:58,530 --> 14:18:08,800
Eric Wulff: uh, they have some algorithm that decides which wants to terminate, and which ones to continue training, and in place of the ones you terminated, you start new trials

765
14:18:08,820 --> 14:18:12,450
Eric Wulff: with the with new hyper parameter configurations.

766
14:18:12,500 --> 14:18:19,529
Eric Wulff: Um. So then, that if you have many more trials, then you have confused notes. You have to

767
14:18:19,720 --> 14:18:27,839
Eric Wulff: pause some of some trials at a point five, and then load in new trials and train them out until they book five.

768
14:18:28,230 --> 14:18:30,749
Shigeki: Okay. So

769
14:18:31,070 --> 14:18:35,280
Shigeki: okay. But to a certain extent, though the the the trials are running independent,

770
14:18:35,290 --> 14:18:51,889
Shigeki: and they get synchronized at some point by by the Atlantic. That you that you're that you're stopping at. But other than that within up to that epoch point uh they're running. They're they're what they're blasting through the the the data as quickly as they they they can. And And so they? They're not in sync. Okay,

771
14:18:52,640 --> 14:18:53,690
Shigeki: thank you.

772
14:18:56,430 --> 14:18:59,330
Enrico Fermi Institute: So how long does it take to run this on,

773
14:18:59,370 --> 14:19:07,800
Enrico Fermi Institute: you know, for for one node? You know. How long is it running the the hyper parameter optimization in terms of all, all all time? Hours?

774
14:19:08,120 --> 14:19:09,599
Eric Wulff: Um!

775
14:19:10,010 --> 14:19:11,059
Eric Wulff: So

776
14:19:11,130 --> 14:19:21,010
Eric Wulff: that that can vary a lot, depending on how large your search basis and the can, what we use and the data that we use, and so on, I think for the for the results I show here.

777
14:19:21,310 --> 14:19:22,860
Eric Wulff: Um

778
14:19:23,820 --> 14:19:26,859
Eric Wulff: uh, If I remember correctly,

779
14:19:27,120 --> 14:19:33,029
Eric Wulff: the whole thing took uh around eighty hours in

780
14:19:33,190 --> 14:19:35,740
Eric Wulff: in wall time,

781
14:19:35,980 --> 14:19:40,909
Eric Wulff: and that's using uh that was using uh twelve

782
14:19:40,930 --> 14:19:45,800
Eric Wulff: confused notes with four to us each.

783
14:19:45,810 --> 14:20:11,110
Enrico Fermi Institute: That can be, you know, trivially broken up into into multiple drops and things like that. The reason I ask is one of the things I notice is that on you know some of the Hpcs uh, at least in the Us. Right. They they have, you know, maximum wall time, for you know you jobs in the queues right? So like I'm i'm looking at, you know pearl matter right now, and it says you can have a a gpu job in the regular queue uh for twelve hours at most.

784
14:20:11,120 --> 14:20:15,659
Enrico Fermi Institute: And so i'm wondering like, what what useful work can we get done, or

785
14:20:15,870 --> 14:20:25,280
Enrico Fermi Institute: you know, hyperparameter, optimization or machine learning in general, you know, given the relatively short maximum of all time.

786
14:20:25,450 --> 14:20:29,280
Eric Wulff: Um. So one solution is to uh

787
14:20:29,460 --> 14:20:31,290
Eric Wulff: tick points. The

788
14:20:31,950 --> 14:20:39,149
Eric Wulff: the the search, and then just launch it again and continue where you left off. So the we're able to do that. So

789
14:20:39,190 --> 14:20:44,300
Eric Wulff: we are saving checkpoints regularly through the the workload.

790
14:20:45,570 --> 14:20:47,679
Eric Wulff: Okay? And uh, yeah,

791
14:20:47,820 --> 14:20:50,360
Enrico Fermi Institute: how often do you save the checkpoints?

792
14:20:51,280 --> 14:21:07,169
Eric Wulff: Um, That's configurable, But usually once per epoch. So once once per read through data sets.

793
14:21:08,020 --> 14:21:15,920
Eric Wulff: Uh that. That depends a lot also. But um, let's say you around well between twelve and twenty four hours.

794
14:21:17,110 --> 14:21:20,540
Eric Wulff: But this completely depends on how much data you have. And uh,

795
14:21:21,140 --> 14:21:24,060
Eric Wulff: you know the the particular model they use.

796
14:21:24,530 --> 14:21:41,880
Enrico Fermi Institute: That's an epoch for the hyper parameter optimization itself, not just the the neural net a single instance of the neural network

797
14:21:42,740 --> 14:21:45,710
twenty-four hours for a single,

798
14:21:46,740 --> 14:21:53,449
Eric Wulff: and that's um. So that you know we have quite a big data set. So that's

799
14:21:53,510 --> 14:22:00,430
Eric Wulff: why. But we're also using four G four, and the J. One hundred gpus for that. So

800
14:22:00,820 --> 14:22:02,320
Eric Wulff: if you have a

801
14:22:02,640 --> 14:22:05,420
Eric Wulff: all the gpus that would take much longer,

802
14:22:08,980 --> 14:22:19,460
Enrico Fermi Institute: I I guess What I'm wondering is, you know, for for the report, should we, you know, have some recommendation that the the policies at these sites. You know how

803
14:22:20,140 --> 14:22:25,540
Enrico Fermi Institute: you know much longer Gpu jobs to run to do these sorts of tasks.

804
14:22:26,090 --> 14:22:29,069
Eric Wulff: Well, my opinion is that it would be

805
14:22:29,720 --> 14:22:47,669
Enrico Fermi Institute: it would be convenient to see if it if we could. But you know it's not deal breaking, because we can't checkpoint this, and just to relo right. But can you for it? You just said your your epochs are twelve to twenty-four hours, and Lincoln just said that

806
14:22:47,720 --> 14:22:57,990
Eric Wulff: twelve hours. So the sorry sorry sorry. So I uh, yeah, yeah, I I I I spoke here so

807
14:22:58,500 --> 14:23:13,459
Eric Wulff: uh apologies. It's a bit late over here. So it it takes it takes twenty-four hours for a full training. Not for one.

808
14:23:13,470 --> 14:23:33,439
Enrico Fermi Institute: We're not asking for a policy change, right? Just a behavioral change with checkpointing. And you're saving at the end of each full training or each actual. So it's as much. Uh: yeah, yeah, Sorry for it. You have, like two hundred epochs. Is that right? Yeah, you're probably having the plot.

809
14:23:33,650 --> 14:23:37,789
Eric Wulff: Uh: yeah, yeah, in the plot here. So Um:

810
14:23:38,030 --> 14:23:56,069
Eric Wulff: yeah. And so the this is plot from last year. Now we have a large data set, and we train for about a hundred epochs, and that takes uh roughly, twenty four hours.

811
14:23:57,900 --> 14:23:59,820
Enrico Fermi Institute: Okay, Um,

812
14:24:00,170 --> 14:24:13,310
Enrico Fermi Institute: yeah, with adding more gpus per node help you in terms of a number of epochs? Or do you have enough data to get reasonable convergence with, or at least with this model after one hundred? You

813
14:24:21,110 --> 14:24:22,430
Eric Wulff: actually we are.

814
14:24:22,690 --> 14:24:27,659
Eric Wulff: We just saw that if we scale up our model

815
14:24:27,690 --> 14:24:40,729
Eric Wulff: significantly, so make make the model larger. With many more parameters we can easily improve the physics performance. Um. So we just try that the

816
14:24:41,300 --> 14:24:44,330
Eric Wulff: this week,

817
14:24:44,660 --> 14:24:47,859
Eric Wulff: because we were curious. Basically Uh, however,

818
14:24:47,920 --> 14:24:49,790
Eric Wulff: that's sort of not a

819
14:24:58,390 --> 14:25:02,050
Eric Wulff: quickly enough in production, anyway.

820
14:25:02,590 --> 14:25:03,639
Eric Wulff: Um,

821
14:25:06,150 --> 14:25:08,350
Eric Wulff: but it sort of shows that the

822
14:25:08,440 --> 14:25:17,159
Eric Wulff: there is enough information in the data to do better. We just uh need to improve the model or the the training of the model somehow.

823
14:25:20,160 --> 14:25:25,100
Enrico Fermi Institute: Okay, Um, see, you have your hand raised.

824
14:25:25,830 --> 14:25:42,530
Shigeki: Uh, yeah, I just have a question on in terms of the amount of data you're going through, and the model size. Uh, I guess that's measured in terms of number of parameters as well as hyper parameters. And whether or not This Is Is there a Is there a a a a

825
14:25:42,540 --> 14:25:54,120
Shigeki: size that that physics problems, and in atp tend to gravitate to, or it can be all over the map in terms of model size and data, set size and and number of hyper parameters.

826
14:25:55,040 --> 14:25:56,179
Eric Wulff: Um!

827
14:25:56,320 --> 14:26:00,129
Eric Wulff: So the number of heavy parameters. Um,

828
14:26:00,190 --> 14:26:07,620
Eric Wulff: that's a little bit arbitrary, dependent on what you mean with have parameters. So if you

829
14:26:08,040 --> 14:26:10,180
Eric Wulff: uh if you count

830
14:26:10,250 --> 14:26:11,389
Eric Wulff: well,

831
14:26:11,430 --> 14:26:13,889
Eric Wulff: you you you can configure

832
14:26:14,040 --> 14:26:23,330
Eric Wulff: but very many things with our model. So if you, if you count all those hyper parameters, I don't know how many they are, but there are hundreds, and we don't two, not of them, because they're too many.

833
14:26:28,100 --> 14:26:33,720
Eric Wulff: Uh, the the number of trainable parameters in the model is around one million,

834
14:26:34,130 --> 14:26:37,850
Eric Wulff: so that's fairly small, if you

835
14:26:37,890 --> 14:26:39,450
Eric Wulff: compared with other uh

836
14:26:40,090 --> 14:26:46,880
Eric Wulff: other sciences, like image recognition, or natural language processing, then this is really a small model.

837
14:26:47,030 --> 14:26:48,389
Eric Wulff: Um!

838
14:26:48,570 --> 14:26:50,480
Eric Wulff: How we we think that

839
14:26:50,580 --> 14:26:52,679
Eric Wulff: I I actually don't know the

840
14:26:53,190 --> 14:26:57,809
Eric Wulff: the memory requirements that we have to uh

841
14:26:57,850 --> 14:27:05,289
Eric Wulff: that here, too, if this would go into production at some point in the future. But I don't think we could go much larger

842
14:27:05,410 --> 14:27:19,759
Eric Wulff: uh, at least not without uh doing some kind of conversation. Uh, we're training or post training, conversation, or perhaps pruding weights after training or doing some other tricks like that

843
14:27:19,990 --> 14:27:23,109
Eric Wulff: uh data set size. So the

844
14:27:23,680 --> 14:27:26,389
Eric Wulff: the one we are currently using.

845
14:27:30,540 --> 14:27:34,559
Eric Wulff: I think it's around four hundred thousand events

846
14:27:35,000 --> 14:27:38,260
Eric Wulff: collision events of of the the different kinds.

847
14:27:40,140 --> 14:27:44,790
Shigeki: Do you have an approximate idea of how much actual gigabytes that is?

848
14:27:45,140 --> 14:27:46,559
Eric Wulff: Um

849
14:27:47,210 --> 14:27:48,730
Shigeki: auto-

850
14:27:49,250 --> 14:27:51,920
Eric Wulff: is a few hundred gigabytes

851
14:27:52,100 --> 14:27:54,480
Eric Wulff: less than a thousand,

852
14:27:55,530 --> 14:28:08,920
Shigeki: and presumably when you're when you're running this, it's, it's it's it's it's compute bound not not not uh a I o bound from from uh in terms of feeding the they uh, the the training data,

853
14:28:08,950 --> 14:28:11,229
Shigeki: or it depends.

854
14:28:11,450 --> 14:28:18,439
Eric Wulff: No, I would say it's compute bound. Oh, you mean looking at the Gpu utilization. It goes to

855
14:28:18,590 --> 14:28:20,070
Eric Wulff: it close to one hundred.

856
14:28:20,139 --> 14:28:22,229
Shigeki: Mhm Okay, thanks.

857
14:28:22,559 --> 14:28:27,009
Enrico Fermi Institute: And you know how much of the memory and the Gpu you're using, or have you?

858
14:28:27,570 --> 14:28:30,279
Eric Wulff: Uh, yes, we uh we,

859
14:28:30,400 --> 14:28:33,209
Eric Wulff: you see, all of it. Basically

860
14:28:34,049 --> 14:28:40,529
Enrico Fermi Institute: So then you're not. It would not help you to have centers that chop up these big gpus.

861
14:28:41,969 --> 14:28:45,769
Eric Wulff: I don't think so. Um. So there is a problem.

862
14:28:45,930 --> 14:28:57,160
Eric Wulff: Um, with having two large batch sizes sometimes. Um basically in order to fill up the gpu. You you increase the bad size as you use for training.

863
14:28:57,230 --> 14:28:58,449
Eric Wulff: Um,

864
14:28:59,530 --> 14:29:05,829
Eric Wulff: and that means you can push more date, though,

865
14:29:05,850 --> 14:29:14,719
Eric Wulff: through per time units, but you know it. It doesn't necessarily mean you can do more optimization steps. So you you might not

866
14:29:14,879 --> 14:29:17,020
Eric Wulff: uh reach

867
14:29:17,160 --> 14:29:20,090
Eric Wulff: the same accuracy quicker.

868
14:29:26,629 --> 14:29:38,190
Eric Wulff: It's it's not obvious or so it's always the case that you can just uh throw more memory at it than it helps. Yeah, I was actually thinking of swapping it the other way with. Uh,

869
14:29:38,990 --> 14:29:45,470
Enrico Fermi Institute: we have a question in our data center of how much we should chop up using Meg the a one hundreds,

870
14:29:47,480 --> 14:29:50,440
Enrico Fermi Institute: you know. Give person a whole

871
14:29:51,010 --> 14:29:54,830
Enrico Fermi Institute: eighty gigs. Were split it up two ways or four ways

872
14:29:55,139 --> 14:30:03,550
Eric Wulff: uh to to several users at the same time.

873
14:30:05,549 --> 14:30:06,580
Enrico Fermi Institute: Thanks.

874
14:30:07,530 --> 14:30:09,519
Enrico Fermi Institute: Show another comment:

875
14:30:12,860 --> 14:30:17,950
Enrico Fermi Institute: Sorry I got to the

876
14:30:18,650 --> 14:30:27,329
Dirk: yeah. I I had a question, and it's not so much. I mean, Eric, if you know you can answer, but it's more uh looking at broader,

877
14:30:27,559 --> 14:30:38,899
Dirk: the and more broader impact of that, and follow on because this is this is interesting, and this is on the But What's the next step? Have there been any discussions how

878
14:30:38,969 --> 14:30:41,610
Dirk: to integrate this in like?

879
14:30:41,700 --> 14:30:58,269
Dirk: Eventually? You You said It's work. It's improving particle. Flow. So eventually it should feed back into the Re. How we run the the reconstruction? Basically, And then the question comes, uh, what, how would you actually deploy this? How often do you have to run it?

880
14:30:58,540 --> 14:31:19,770
Dirk: How long does it take? And and how often do I have to renew like, renew it Basically, with new data to to check that the parameters are still okay and has has, and it's not just a question about the specific thing that So this is like the larger questions. Maybe Lindsay or I don't know if Mike might ask for Link connected if there have been any

881
14:31:19,780 --> 14:31:26,789
Dirk: discussions of that already, or or if that's still to come after the on. The initial on D is done.

882
14:31:30,130 --> 14:31:33,150
Eric Wulff: Well, I would say, if uh,

883
14:31:33,470 --> 14:31:36,980
Eric Wulff: if we are able to prove, or

884
14:31:37,030 --> 14:31:38,920
Eric Wulff: somehow show, that

885
14:31:39,020 --> 14:31:43,090
Eric Wulff: this machine learned approach to particle flow works

886
14:31:43,170 --> 14:31:44,490
Eric Wulff: uh

887
14:31:44,880 --> 14:31:52,579
Eric Wulff: as well, but more efficiently, or or even uh better than the uh

888
14:31:52,610 --> 14:31:54,660
Eric Wulff: method that are used at the moment.

889
14:31:55,670 --> 14:31:59,449
Eric Wulff: Um. Then we then we sort of free that model and

890
14:31:59,690 --> 14:32:04,779
Eric Wulff: get it into production, and then we shouldn't need to redo any hyper,

891
14:32:04,820 --> 14:32:34,339
Dirk: current documentation or anything like that. Then it, you know we Then it's like having a finished algorithm, that. Just Yeah. But the data taking the detector changes all the time. So who knows if the twenty right. If if the training you did on two thousand and twenty-two data, or even run two data is still valid for your next set of data. That's right. So we're we're not. We're not training on date, but we're trying a simulation. Okay, right. But but I think this is when we talk about these kind of a problems, and one of things needs to be studied

892
14:32:34,580 --> 14:32:44,590
Ian Fisk: is how stable these are, and whether they really like, cause it could be that we're incredibly lucky, and they once you hype once you do the hyper parameter optimization that it's applicable to

893
14:32:45,180 --> 14:32:51,009
Ian Fisk: small changes in data. Um, And one thing that this I think we can see from Eric's plots is that it?

894
14:32:51,050 --> 14:33:01,189
Ian Fisk: It makes these things faster. They train faster and better after they can optimize. And so if we were in unreasonably lucky, they'll actually save us resources.

895
14:33:02,360 --> 14:33:03,300
Okay,

896
14:33:03,500 --> 14:33:08,860
Dirk: Okay. But it sounds like It's a discussion that's still to come. That's not. We're not quite there yet.

897
14:33:09,400 --> 14:33:25,109
Ian Fisk: Well, I think so. I think the the thing we do is we given how much this improves the situation where chances are. And I think this is applies to multiple science fields, not just ourselves, that we should be factoring these things in in our discussion about how we're going to use Hc.

898
14:33:25,140 --> 14:33:35,829
Ian Fisk: Um for the report. Um! And then we'll have to wait and see as to whether this thing that's a a workful that we're constantly running, or one that we are running once in a while.

899
14:33:39,190 --> 14:33:47,179
Mike Hildreth: Yeah, I guess I would agree with that. Um, I don't. Yeah, we haven't had A. We. We don't have enough data.

900
14:33:47,840 --> 14:33:53,670
Mike Hildreth: How often we're going to have to train these. But this use case is certainly in the planning.

901
14:33:54,080 --> 14:33:55,760
Enrico Fermi Institute: Is it right?

902
14:33:55,850 --> 14:34:07,809
Enrico Fermi Institute: I think the one remaining worry is, we haven't been through like a complete recalibration cycle of the detector. Uh uh, after a stop or anything like that to see if

903
14:34:07,820 --> 14:34:21,400
Enrico Fermi Institute: to see if it or to see how robust a single training is, or the most optimal training is. With respect to the changing parameters of the detector, and it's just something we have to find out. But it's not going to change the pattern. All that much to be honest.

904
14:34:21,410 --> 14:34:28,360
Enrico Fermi Institute: But yeah, I agree with the in here. It's this: This is probably going to save us resources as well in the long run.

905
14:34:28,620 --> 14:34:30,320
Dirk: Okay, thanks.

906
14:34:30,510 --> 14:34:38,550
Dirk: That makes it difficult for us to write because we can write the use case in, but it's extremely hard to attach any numbers to it at the moment.

907
14:34:41,470 --> 14:34:55,099
Enrico Fermi Institute: Yeah, I mean, we, I guess, to another way to summarize it. We've shown that this works, and that we can get really great results out of it, but we haven't understood the true uh, you know, steady state operational parameters of of this system.

908
14:34:59,230 --> 14:35:04,370
Eric Wulff: And just to be clear like you, there there still needs to be a

909
14:35:04,610 --> 14:35:08,699
Eric Wulff: quite a bit of work before this would be ready to go into production.

910
14:35:09,140 --> 14:35:10,600
Eric Wulff: It's still

911
14:35:10,880 --> 14:35:14,050
Eric Wulff: uh like we, we, we don't understand

912
14:35:14,200 --> 14:35:18,509
Eric Wulff: all the properties of how it reconstructs particles well enough. Yet,

913
14:35:20,650 --> 14:35:23,980
Eric Wulff: although you know it's looking good, it's. It's looking promising,

914
14:35:24,230 --> 14:35:30,350
Eric Wulff: but it it needs to be validated and much more before production.

915
14:35:41,060 --> 14:35:44,129
Enrico Fermi Institute: So we have more question, for

916
14:35:46,660 --> 14:35:50,649
Enrico Fermi Institute: I guess one silly question

917
14:35:51,140 --> 14:36:03,900
Enrico Fermi Institute: in terms of actually trying to use this like in Cmssw. And this is mostly because I don't remember the last time that Joseph presented this, How fast does this go per event in inference mode?

918
14:36:04,220 --> 14:36:06,810
Enrico Fermi Institute: How many, what's the throughput look like?

919
14:36:06,940 --> 14:36:24,380
Eric Wulff: Um, I don't think we have done anything there that would be comparable to it. Production? So it um, or maybe an even better question is, what's what's the memory footprint look like on Gpu or Cpu

920
14:36:24,770 --> 14:36:31,000
Eric Wulff: uh, I don't know that on top of my head, but I know we have a plot somewhere that I can

921
14:36:31,100 --> 14:36:32,899
Enrico Fermi Institute: all good. Thank you.

922
14:36:37,540 --> 14:36:46,069
Enrico Fermi Institute: Okay. There are no other questions, and we can. You can move on. Ah, thank you very much for the presentation, Eric.

923
14:36:46,360 --> 14:36:48,119
Eric Wulff: No problem. Thanks for listening.