Morning session [Eastern Time]

 

[Kenyi Paolo Hurtado Anampa] 11:05:44
Okay, So good morning. Everyone. Today, we will call queues.

[Kenyi Paolo Hurtado Anampa] 11:05:51
Okay, 2 different things. One is resources. This is going to be all of the morning session, and then in the afternoon we are going to talk mostly about networking and the assistant reference of what Hpc and Clouds and then R and D: So for the I would focus area, we

[Kenyi Paolo Hurtado Anampa] 11:06:12
will start with the just summarizing at the very high level.

[Kenyi Paolo Hurtado Anampa] 11:06:17
What Atlas and Cms have done, and the case of at last.

[Kenyi Paolo Hurtado Anampa] 11:06:24
Well, this is this has been okay. The the and shame they got a self contained your site, and this is they're they're linked to the Ukrainian, and they have their own screen Cdfs.

[Kenyi Paolo Hurtado Anampa] 11:06:42
We will This will be talking more detail in good next. Okay?

[Kenyi Paolo Hurtado Anampa] 11:06:47
And then for Pms: This is basically describing what it was done about 5 6 years ago.

[Kenyi Paolo Hurtado Anampa] 11:06:55
During the demo testing that Cms: did with production portfolios and the the the way this was done was by extending an existing Csi, and it more particularly the formula the resource with resources in the Cloud and This was done Via.

[Kenyi Paolo Hurtado Anampa] 11:07:14
He cloud again, this will be describing more detail in the next few sites, since this is since this was done this way, maybe in terms of production integration, we have the same reservations as Hpcs in terms of historic for work, on my chains which means that all data must be staged

[Kenyi Paolo Hurtado Anampa] 11:07:36
to existing sites.

[Kenyi Paolo Hurtado Anampa] 11:07:42
go on with the next slide, and this is for advice Fernando.

[Fernando Harald Barreiro Megino] 11:07:48
yeah, sure. Yeah, So this is the overview of what we are working on in Atlas.

[Fernando Harald Barreiro Megino] 11:07:54
So we have 2 main projects, the one on the left is on Amazon, and this comes through a Fresno from California State University.

[Fernando Harald Barreiro Megino] 11:08:04
And here we have basically, panda queue storage element, and also a squid, And those are always the 3 main cost components that we will have later with.

[Fernando Harald Barreiro Megino] 11:08:13
An it's also the the egos, and then the second part is Cook is the product that we have in.

[Fernando Harald Barreiro Megino] 11:08:23
Google it used to be. Us atlascentric But now, this year, since the middle of July, it became a worldwide, dotless project.

[Fernando Harald Barreiro Megino] 11:08:35
And so Atlas, is. A collaboration, is it's participating in the budget.

[Fernando Harald Barreiro Megino] 11:08:42
and here in this project, we do have a similar setup, us in Amazon with panda queue truth, storage element, and the squid.

[Fernando Harald Barreiro Megino] 11:08:50
But we also how we work like on a analysis facility prototype, with a 2 bitteran task.

[Fernando Harald Barreiro Megino] 11:08:59
so the integration of this of these cloud resources. We're done by the route team on the ponder team.

[Fernando Harald Barreiro Megino] 11:09:07
So we take a different approach done. If you are like trying to extend her side, and we we just generate our self-contained on cloud native side, in the case of truth, on the storage so it works in the way that to download the key for from Amazon or from

[Fernando Harald Barreiro Megino] 11:09:28
Google. And with that the key you can sign Url. And with the Url in the center Url, you say, you can upload a particular file until an hour from now, or you can download or Delete of and then this key needs to be put into ruthie and into fts so that they can generate

[Fernando Harald Barreiro Megino] 11:09:47
the assigned Url to with the downloads or the third party transfers for the compute path. It's all based on kubernetes and native integration.

[Fernando Harald Barreiro Megino] 11:10:00
There is particular. There is no nothing like a condor in the setup, and then we have Cvm Fs installed in the closer to our kubernetes planning That was one of the things that actually took most of the F to get at the very reliable

[Fernando Harald Barreiro Megino] 11:10:18
state, and then also the this quick part I mean, that's you can either run it in part as a part of the who could need this cluster in, Google for example, I just run load balance load balance the instance, great and the other thing that for the computer I always use is the outer

[Fernando Harald Barreiro Megino] 11:10:41
scaling. So when there are no jobs cute, for example, the panda compute part.

[Fernando Harald Barreiro Megino] 11:10:48
It shrinks to a minimum, and then, if you submit a lot of tops it, the the cluster grows up to, or the limit, or as much as needed for hosting all of the jobs yeah the the setup, is, it's not bound to any particular cloud provider

[Fernando Harald Barreiro Megino] 11:11:07
it's just Stanford protocols and technology.

[Fernando Harald Barreiro Megino] 11:11:10
So you can in principle use the same setup in other cloud.

[Fernando Harald Barreiro Megino] 11:11:13
Providers. For example, I tried out the dependent part one time in in Oracle Cloud, just to see that it works.

[Fernando Harald Barreiro Megino] 11:11:22
Yeah, then in the next slide, please.

[Fernando Harald Barreiro Megino] 11:11:27
So one of the things that you can exploit on all of these commercial clouds, is all the different types of architectures that they have and that you don't always have on on grid sites.

[Fernando Harald Barreiro Megino] 11:11:41
One particular example is on Amazon. We were doing some arm testing.

[Fernando Harald Barreiro Megino] 11:11:47
So for this case it was Johannes and the teenager team that were trying to build the Athena simulation software for arm 64.

[Fernando Harald Barreiro Megino] 11:11:58
They had done the building, and they wanted to do a small physics.

[Fernando Harald Barreiro Megino] 11:12:01
Validation, or run a of a whole task. With that.

[Fernando Harald Barreiro Megino] 11:12:04
But there was not really any volunteer, any available grid site with arm resources that could set that up.

[Fernando Harald Barreiro Megino] 11:12:12
So what we did is we set it up in in Amazon with the cravat on 2 notes in the in the right side diagrams.

[Fernando Harald Barreiro Megino] 11:12:25
It's just the first validation that you honest it with 10,000 events.

[Fernando Harald Barreiro Megino] 11:12:29
And he compared the X 86 that had been executed in track.

[Fernando Harald Barreiro Megino] 11:12:32
I believe, against the arm, 64 on arm as an end.

[Fernando Harald Barreiro Megino] 11:12:36
It was it was matching quite well. And then, some weeks later, we prepared the full physics, validation with a 1 million events, and that was fully signed off few weeks ago.

[Fernando Harald Barreiro Megino] 11:12:49
So in principle, simulate simulation could be executed.

[Fernando Harald Barreiro Megino] 11:12:57
like in standard. Production. Now and I mean, we don't do this in particular for the cloud, but we do it more like it was discussed yesterday in the Hpc.

[Fernando Harald Barreiro Megino] 11:13:06
Session, where most of the next generation Hpcs are going to come up with a with more on Cpus and X.

[Fernando Harald Barreiro Megino] 11:13:17
86 is going to be dominant so it's a preparation for that.

[Fernando Harald Barreiro Megino] 11:13:20
Other things, so other exotic architectures or resources that can be Houston. Huh!

[Fernando Harald Barreiro Megino] 11:13:27
Thanks that can be used in in the cloud. For example, there is a user that is doing some trigger studies for a filter and there he's using Fpgas on Amazon or Johann for building the software for He uses very large notes on on Amazon and

[Fernando Harald Barreiro Megino] 11:13:47
Google and also Cpu stuff. Next slide, please, And if anyone has a question or comment while I'm going through the slice, you can interrupt me.

[Fernando Harald Barreiro Megino] 11:14:01
Now we come on, Google, just running Google as a great site, you can see 2 different approaches on the right top plot.

[Fernando Harald Barreiro Megino] 11:14:14
You can see how we were doing like scalar test.

[Fernando Harald Barreiro Megino] 11:14:17
That was done during the previous funding around until we were trying to see how how far we can scale it in a single client region, and we were getting to a 100,000 course in Europe West one P.

[Fernando Harald Barreiro Megino] 11:14:35
Which is an one of the one of the European regions, and if you would want to scale this out even more, you could replicate the setup to to whatever, to us to multiple regions in Europe, and so on and reaching a very high number of costs what we are doing now in

[Fernando Harald Barreiro Megino] 11:14:56
since it's fully worldwide Atlas Project is, we're running at the moment, a fixed, dry, a fixed size grid side.

[Fernando Harald Barreiro Megino] 11:15:05
We started with 5,000 calls, and we moved it to 10,000 costs.

[Fernando Harald Barreiro Megino] 11:15:10
It's exactly a month ago, and we can run any type of production.

[Fernando Harald Barreiro Megino] 11:15:16
we are not running analysis at the moment, because we need to reorganize the the storage.

[Fernando Harald Barreiro Megino] 11:15:23
In particular, we need a data discontent, separate, stretch disk so that user outputs don't end up in the same storage element.

[Fernando Harald Barreiro Megino] 11:15:32
but the other one Is this: The discrete site has worked very well.

[Fernando Harald Barreiro Megino] 11:15:38
It's very reliable, and also a very low error rate.

[Fernando Harald Barreiro Megino] 11:15:41
And when the the errors are usually very focused on particular situations, like, for example, I great to machines with low disk, or one at the time, there were issues with the with the phone tasks and I had, to fix that and our goal, is to do like a mix of

[Fernando Harald Barreiro Megino] 11:16:02
both both versions like, mix the the on demand fast scale out with a fixed size.

[Fernando Harald Barreiro Megino] 11:16:12
So we plan to run more or less. A a flat queue with 5,000 cores, and then on top run a dynamic queue, which processes urgency requests.

[Fernando Harald Barreiro Megino] 11:16:24
Or we are going to do something that we call the full chain where all of your steps in a in a simulation in our production, chain I run inside the same resource and you don't export the you only export the final in order to reduce the egress, cost

[Fernando Harald Barreiro Megino] 11:16:48
yeah, and the next slide. Thanks, Kenny. The other thing that we tried out is this analysis facility prototype?

[Fernando Harald Barreiro Megino] 11:16:56
what we wanted to do is like task scaling evaluations.

[Fernando Harald Barreiro Megino] 11:17:03
So we installed Twitter and task on on Google.

[Fernando Harald Barreiro Megino] 11:17:08
We integrated it with the Atlas Am. So in anyone from Atlas can connect without needing to question in particular new account or anything.

[Fernando Harald Barreiro Megino] 11:17:20
And then we have a couple of different options that the user can select for tasks.

[Fernando Harald Barreiro Megino] 11:17:27
You use this first light version, but then we also have machine learning images.

[Fernando Harald Barreiro Megino] 11:17:31
so that other people use tensorflow, and all those libraries, and you can also, if you want to put notebook with a cpu, and that will take a little moment to to put you need to provision the the machine.

[Fernando Harald Barreiro Megino] 11:17:50
You need to install Cvmfs and load Cmfs, and then added to the to the cluster.

[Fernando Harald Barreiro Megino] 11:17:57
That takes a couple of minutes. But then you have a notebook with a Gpu just for yourself, and you can without as long as you need, and for the task part which is, in my opinion, a very good example, for great scalar, for cloud scalability the right lower plot was

[Fernando Harald Barreiro Megino] 11:18:20
and look at Signage, who was trying out running the same task, but with a different number of workers.

[Fernando Harald Barreiro Megino] 11:18:27
So he ran first with 100 workers, and it took 40 min.

[Fernando Harald Barreiro Megino] 11:18:30
Then he re rerun the same task with the 200 workers, and the duration was half and so we'll until the the last part where he uses 1,500 workers and the task is done within just a few minutes and the the thing about this is that the cost on the cloud is

[Fernando Harald Barreiro Megino] 11:18:51
roughly the same, except for maybe they're just scaling or scheduling overhead.

[Fernando Harald Barreiro Megino] 11:18:57
but the cost is roughly the same. If you run way very few workers, and if you run with a lot of workers and for the use himself, it makes a lot of difference if he gets the results in 1 h or in 5 min, and yeah, we also should consider the in the cost the

[Fernando Harald Barreiro Megino] 11:19:19
calculation with the salary of the of the user himself, since he's optimizing his time. A lot.

[Fernando Harald Barreiro Megino] 11:19:27
Yes, and that's it. Can you? For next night.

[Kenyi Paolo Hurtado Anampa] 11:19:32
it's from then Yes, and so then for Cms again.

[Kenyi Paolo Hurtado Anampa] 11:19:37
This is what it was done a few years ago, and again, as I mentioned before, we did this by integrating cloud resources in one of the their sites at the Fermi website.

[Kenyi Paolo Hurtado Anampa] 11:19:51
via head cloud. So do you basically have a workflow injected.

[Kenyi Paolo Hurtado Anampa] 11:19:55
which is the resource provisioning trigger.

[Kenyi Paolo Hurtado Anampa] 11:19:58
This enters the facility interface which talks to the authentication and authorization mechanisms.

[Kenyi Paolo Hurtado Anampa] 11:20:04
Then the decision. There is a decision engine and a facility pull.

[Kenyi Paolo Hurtado Anampa] 11:20:09
There. And the decision engine basically talks to a provisioner that will be talking to the Microsoft.

[Kenyi Paolo Hurtado Anampa] 11:20:16
The same day cloud. And so this is basically a diagram of the head cloud architecture.

[Kenyi Paolo Hurtado Anampa] 11:20:22
What you have from there is basically going to restart this in the local sources in the cloud.

[Kenyi Paolo Hurtado Anampa] 11:20:34
So you have connecting to the HD. Calendar schedulers in the gliding wms in procedure, And that's How everything.

[Kenyi Paolo Hurtado Anampa] 11:20:43
Is connected in this case

[Kenyi Paolo Hurtado Anampa] 11:20:53
Okay. And the next part is Lans. You know. I think they're talking.

[Dirk] 11:21:01
yes, so Lanceium was already mentioned yesterday. It's it's an interesting new.

[Dirk] 11:21:11
Does that new company, and they because they they're not like a your traditional full service cloud provider that sell you that basically operate worldwide and give you anything you want in terms of capabilities And instance, types And whatever there they really geared towards utilizing low cost renewable energy

[Dirk] 11:21:35
to provide cheap compute, basically. And they're almost like a part of the business model is almost like an energy utility.

[Dirk] 11:21:42
basically they get money for for being able to low chat, and and they're they.

[Dirk] 11:21:47
They construct They're constructing the data centers right now in in in areas with that very high on renewable wind energy, and we we did a test a few months back where we integrated them into production.

[Dirk] 11:21:59
We run a few small workflows. It was all on free cycles as a as a test.

[Dirk] 11:22:05
Basically they they have a bit different than aws. Google. They only support singularity containers, not vms.

[Dirk] 11:22:11
And what we did is we just ran a pilot job in the singularity container, and then then the pilot itself is just to stand that Cms pilot.

[Dirk] 11:22:21
So it runs our payloads in in in a nested singularity container, you know.

[Dirk] 11:22:26
Cvm. Fs and local squid were provided from Nancy.

[Dirk] 11:22:30
We we work with them on that. They currently don't have any local managed storage.

[Dirk] 11:22:34
Just so job scratch. So in, And we basically run these resources like we do Opportunistic was G.

[Dirk] 11:22:39
Or Hbc. Site set where we don't use manage to storage.

[Dirk] 11:22:42
We just used triple a reads to get the input and then stage out to formula.

[Dirk] 11:22:48
So that's that covers the runtime.

[Dirk] 11:22:50
The provisioning integration is another problematic area potentially for long term, because they have a custom api, which is not compatible with aws or Google I mean it already They're running singularity containers so you need some way to start up.

[Dirk] 11:23:05
A container, and what we're doing right now is, we're just using vacuum provisioning.

[Dirk] 11:23:10
So we just when we run to run a test, we just start up a container manually when needs.

[Dirk] 11:23:15
And that's relatively simple through the Api, because the Api is just.

[Dirk] 11:23:19
You can. You can run like some script that that call out to the Python Api.

[Dirk] 11:23:25
Call out to the Api and tell it how many containers are running.

[Dirk] 11:23:28
If it's less than 10, you bring it up to 10 So that's that's that's basically the level of integration that we have right now.

[Dirk] 11:23:36
So we think it's interesting enough. It will cry a little bit of work to to get really working to get it fully integrated.

[Dirk] 11:23:44
But we're working on with with Lansing pretty occurring A small number of cycles for more tests.

[Dirk] 11:23:51
Is it? The plan is maybe to to get some cycles there and then see if we can.

[Dirk] 11:23:57
When there's particular load from Cms specifically on Fermi app, that we can say, Okay, we bring up lensing resources and that freeze up resources at formula to do stuff that is most suited to a tier.

[Dirk] 11:24:10
One

[Enrico Fermi Institute] 11:24:28
just to get in there for a second in the case of all of this, this is very much oriented around production jobs. Do you think we could organize sometime in you know, the next year or something like that trying to interface this with either coffee Casa or the elastic analysis facility?

[Enrico Fermi Institute] 11:24:45
Effort to see if we can gain more flexibility for more bursty analysis jobs, much like what Atlas was doing with Google Cloud and whatnot

[Dirk] 11:24:56
we could try I mean the

[Enrico Fermi Institute] 11:24:58
The security is gonna be a nightmare at elastic analysis.

[Dirk] 11:25:03
Yeah, the the the thing is it? It really depends how well everything plays together with the provisioning integration.

[Enrico Fermi Institute] 11:25:10
Facility for sure

[Dirk] 11:25:10
I mean, they have a simple Api. They just pass it.

[Enrico Fermi Institute] 11:25:10
Yeah.

[Dirk] 11:25:13
You basically you need a token associated with your account, and then you have, a if you're a single python script, like a monolithic python script that they give you where you can tell it to start a container and bring up something so so it's it's

[Enrico Fermi Institute] 11:25:25
Okay.

[Dirk] 11:25:27
relatively simple, so sure, I mean, we can look at it It's it's a matter of Do you want to do it?

[Enrico Fermi Institute] 11:25:32
I mean

[Dirk] 11:25:35
Do you want to do tests, or you want to do it for real?

[Dirk] 11:25:37
Because when if you do it for real, then you actually need to have a paid for a number of cycles sitting there that you can use for tests, we can just go whenever

[Enrico Fermi Institute] 11:25:46
Yeah, I I think we would need to get the facilities at the or the at least the one affirmative action, as it is right now and then go for a more for real test with people's actual, analysis.

[Enrico Fermi Institute] 11:25:57
Jobs. Once we have that set up. I think that would be the better way to see how this actually works.

[Enrico Fermi Institute] 11:26:06
So this is like a year, time scale, or something like that.

[Enrico Fermi Institute] 11:26:09
Your analysis, facility. Do you have any implicit dependencies on shared file systems, or anything like that? Or does everybody because we're at or because we're a fermi, lab we're restricted from using shared shared file systems aside from like, X d And stuff, Okay, yeah, I was gonna

[Enrico Fermi Institute] 11:26:24
say that might be 1 One challenge. Is stretching out to right is, How do you structure file system out there?

[Dirk] 11:26:29
They.

[Enrico Fermi Institute] 11:26:29
Exactly, but thankfully. We've already been forced to solve that

[Dirk] 11:26:33
Maybe because Lindsay just mentioned a year, the the time horizon on that at the currently Lens Zoom.

[Dirk] 11:26:41
As I said, the young company need a starting up. They're kinda still building the data sent us like the the, the So they have a test data center.

[Enrico Fermi Institute] 11:26:45
Hmm.

[Dirk] 11:26:51
That's in Houston, which is not really using renewable energy.

[Dirk] 11:26:54
But where they basically just deploying the the the whole hardware software integration that they're working with, and that's what we've been testing, what they're building right now, and which is supposed to come online at some some point later, this year, or early next year, i'll leave really the big data centers

[Dirk] 11:27:08
which are co-located to like wind, energy, hotspots, and and and Texas.

[Dirk] 11:27:15
There's not much else there, but they're building a data center and they'll be the interesting.

[Dirk] 11:27:20
Ones basically because that's real, renewable in a jet.

[Dirk] 11:27:22
This lots and lots of basically power capacity there. And they they call more importantly, they're gonna connect them to a 100 gigabit to the Us.

[Dirk] 11:27:33
Net and everything else

[Enrico Fermi Institute] 11:27:34
Okay, they're actually going to appear to. They're going to actually can connect Peer with the snap for sure.

[Dirk] 11:27:41
That's what their plan is, because they're they are kind of pushing They're they're basically making the sales pitch hard to academic users.

[Dirk] 11:27:51
I mean, I've seen talks for them on Scc. Os was G.

[Dirk] 11:27:56
Did they basically travel around Europe Because for Europe it's it's like running compute on cheap powers and even bigger concern right now than in the Us.

[Dirk] 11:28:06
because power prices there traditionally have been much higher.

[Dirk] 11:28:09
And now I extremely much higher than than in the Us.

[Enrico Fermi Institute] 11:28:12
But they're gonna connect to Esnet versus Internet to

[Dirk] 11:28:17
Probably I mean they. They said. They're they're really. They.

[Dirk] 11:28:20
They basically point out, that's that's another of this selling points.

[Dirk] 11:28:24
They point out that they want to not charge, for egress.

[Dirk] 11:28:32
So like yeah, like not charging for egos in good network.

[Dirk] 11:28:39
Integration for academic workloads seems to be they're they're they're focusing on that, I mean, because I mean, you have to look at what the they are low quality of service.

[Dirk] 11:28:50
Somewhat by design. They're not like Amazon ready, Sell you to give you a Vm.

[Dirk] 11:28:55
And promise you 99 point whatever. David. Then Zoom tells you.

[Dirk] 11:28:59
If the if the room, there's no wind, we're gonna load chat like crazy.

[Dirk] 11:29:04
So we're gonna evacuate you, and that's that's fine.

[Dirk] 11:29:08
but that also means that they have to have other selling points, because and other target markets, because they're not gonna attract the like the financial sector or industry.

[Dirk] 11:29:18
That wants like high up time, compute service, sitting somewhere

[Ian Fisk] 11:29:22
But but the turkey thing was, is net not Internet too right?

[Dirk] 11:29:26
I'm I'm not exactly sure they They basically we had.

[Ian Fisk] 11:29:28
Oh sure!

[Dirk] 11:29:30
And and you have to remember the discussions we had with them.

[Ian Fisk] 11:29:32
Right.

[Dirk] 11:29:34
But most months before these data, they're still under construction, so I don't think the network is connected yet.

[Ian Fisk] 11:29:38
Right. The only reason I ask is that it is one of the things that yes, Net Charter is very is relatively strict, and it will allow you to connect formula or Bnl or cern to land team resources but for instance, it won't carry from a university so one of the

[Dirk] 11:29:39
Nope.

[Ian Fisk] 11:29:58
endpoints needs to be under the Es net charter, which limits we can be a little limiting

[Dirk] 11:30:04
I mean at the moment I would. Read It as a a statement of intent that they want to do everything they can on the network integration side to make it easy for us to to to use their facilities and then what's actually deployed on how things are connected I think we have to wait for for these data

[Ian Fisk] 11:30:08
Okay.

[Ian Fisk] 11:30:15
Right.

[Ian Fisk] 11:30:22
Right, and it could It's fine that that matches under the charter.

[Kenyi Paolo Hurtado Anampa] 11:30:50
yes. So the next section in this license about cloud costs, and just like yesterday they green.

[Kenyi Paolo Hurtado Anampa] 11:31:04
Here means this is one of the questions that we need the from the charge that we need to answer.

[Kenyi Paolo Hurtado Anampa] 11:31:08
And it's basically what is the total cost of operating commercial power, resources collaboration workflows.

[Kenyi Paolo Hurtado Anampa] 11:31:15
And this is mostly focused on production workforce for the charge, both for computer resources as well as the operational effort for the Lc.

[Kenyi Paolo Hurtado Anampa] 11:31:24
Around 3, and so with that we will start with, what's the experience software in the cost for Atlas and Cms? So for another, do you Wanna take over from here?

[Fernando Harald Barreiro Megino] 11:31:38
yes, so I mean the content of this light is mostly my opinion and experience.

[Fernando Harald Barreiro Megino] 11:31:47
and now just to note that with with this Atlas covid project there will be a dedicated Tc.

[Fernando Harald Barreiro Megino] 11:31:59
Also total cost of ownership board that will then study the costs detail.

[Fernando Harald Barreiro Megino] 11:32:08
so to explain a little bit the the cost model, so that you have in the cloud.

[Fernando Harald Barreiro Megino] 11:32:13
So for the computer are different levels of of the virtual machine.

[Fernando Harald Barreiro Megino] 11:32:20
So you have the reserved instances What you say basically I will.

[Fernando Harald Barreiro Megino] 11:32:24
I reserve for so many cpus for a year, and but that also means that you are stuck for the full year with those reserved instances, and there is no real elasticity.

[Fernando Harald Barreiro Megino] 11:32:38
Then there is the on demand which is the appetite for for the on-demand virtual machines, where, when you want to request, a virtual machine, and then, once you have it that way too much, for you for like forever and then the lower t of the on the month is the spot

[Fernando Harald Barreiro Megino] 11:32:58
where you will request a virtual machine, and then can get it, and you can also.

[Fernando Harald Barreiro Megino] 11:33:06
It can also be taken away from you whenever Google needs it.

[Fernando Harald Barreiro Megino] 11:33:09
For someone else, or it can be just their cute and put somewhere else if they need to do some optimization within their computing center.

[Fernando Harald Barreiro Megino] 11:33:24
Yeah, it's I think it's 37. So it's not.

[Fernando Harald Barreiro Megino] 11:33:31
It's nothing, but we can in practice work with.

[Fernando Harald Barreiro Megino] 11:33:34
So if you get the kill signal, you lost the whatever, so was running on the Vm.

[Fernando Harald Barreiro Megino] 11:33:40
Until then.

[Fernando Harald Barreiro Megino] 11:33:44
but the experience with Spot. It's quite good, in my opinion, because the in Google in particular also have only the preemptable vms which had the maximum lifetime of 24.

[Fernando Harald Barreiro Megino] 11:33:59
Hours and now they've stopped that model, and they've moved to Spot, and there you can have the the virtual machines for a long time.

[Fernando Harald Barreiro Megino] 11:34:07
and I don't see a significant amount of failed, wasted world clock time because of spot and got this.

[Fernando Harald Barreiro Megino] 11:34:16
Well, you will see it later, but it's like 60% cheaper than on the month.

[Fernando Harald Barreiro Megino] 11:34:22
Then for the storage you have also different categories.

[Fernando Harald Barreiro Megino] 11:34:27
There is the Standard number line call line, like the the time to access your files.

[Fernando Harald Barreiro Megino] 11:34:33
Is the same with all of them, with with the th.

[Fernando Harald Barreiro Megino] 11:34:38
The Then you go to the right. You need to keep the data on the storage for longer, So I think, Neil, and you need to keep it 30 days cold. Line.

[Fernando Harald Barreiro Megino] 11:34:51
I don't know how many days I'm so.

[Fernando Harald Barreiro Megino] 11:34:52
Also the the question that you get, the more you pay by the access.

[Fernando Harald Barreiro Megino] 11:34:58
So in practice for the storage, we are always using standard the standard class.

[Fernando Harald Barreiro Megino] 11:35:05
So that's a traditional course model. Now, there is a new cost model.

[Fernando Harald Barreiro Megino] 11:35:09
That is what we are using in, and it's this: Us public sector subscription agreement with, Google And they're basically, if you are university or or a lot from the public sector, you can negotiate a fixed price for your computing needs So agreement

[Fernando Harald Barreiro Megino] 11:35:31
for 10,000. See total cpus and 7 petabytes of storage, and it's for $600,000 for 15 months.

[Fernando Harald Barreiro Megino] 11:35:48
and you pay that amount, and you don't need to worry.

[Fernando Harald Barreiro Megino] 11:35:51
If then you have more egress or less ears to optimize your, you're agreement, obviously.

[Fernando Harald Barreiro Megino] 11:35:58
And this protects you from surprises Under the end of your 15 months.

[Fernando Harald Barreiro Megino] 11:36:04
I guess it will be a review like while you, your egress, is out of control to try tomorrow or not, or maybe they are happy with the situation, and they get renegotiated and I don't want to talk about the exact amount of dollars that we have in the in our agreement.

[Fernando Harald Barreiro Megino] 11:36:24
But I just want to say that it's very favorable, and it's lower than that this prices the under thing to consider in in these clouds is that the resources are very elastic.

[Fernando Harald Barreiro Megino] 11:36:41
So it's a bit what they try to show with in the task Example: the cost.

[Fernando Harald Barreiro Megino] 11:36:45
For 10,000. Cpus for and 1 h. It's the same as the cost of for one Cpu.

[Fernando Harald Barreiro Megino] 11:36:50
For of 10,000 h, so you can run the real estate without a major cost increase, and also, since it's the last like, if you ramp down, you don't need to keep any other resources, and you just really pay for what you use and from an operational perspective.

[Fernando Harald Barreiro Megino] 11:37:17
in in my opinion, it's very low cost. There is, I mean, the whole development setup on operation of the of the Lutheran part was done by one of the route experts and also all of the development setup and operation for the Underpassed by one panel expert fractions, of the of

[Fernando Harald Barreiro Megino] 11:37:38
the right, And I also think that here this model is like really pure devops for the most pure form of it.

[Fernando Harald Barreiro Megino] 11:37:49
You operate aside. You see, I'm you also learned things that are not good for the site that you can improve in in Fanda or in harvest.

[Fernando Harald Barreiro Megino] 11:37:59
And then you go and change those things, and also with the same amount of Fd.

[Fernando Harald Barreiro Megino] 11:38:08
Resources like, if I run a 10,000 core. Cluster, 30,000 core class that doesn't really make up difference.

[Fernando Harald Barreiro Megino] 11:38:18
I'm now moving to the plot on the right, so here what I'm showing is the all of the pins, except the last one to the right are simulations using the cost you later.

[Fernando Harald Barreiro Megino] 11:38:34
the first one are on Amazon, the second ones are on Google. Good.

[Fernando Harald Barreiro Megino] 11:38:38
and I didn't average of the Usda tools and play so that there are in average it's 10,000 data cpus.

[Fernando Harald Barreiro Megino] 11:38:48
Of course, 7 petabytes of of storage

[Fernando Harald Barreiro Megino] 11:38:55
I don't know what she keeps writing, and then also an average.

[Fernando Harald Barreiro Megino] 11:39:04
There was 1 point, 5 petabyte of egress per month.

[Fernando Harald Barreiro Megino] 11:39:06
I looked it up in the dtmashboard, and then I went to the Google price calculator and I using different types of oh Pms.

[Fernando Harald Barreiro Megino] 11:39:17
I calculated the cost. The blue part is the the cpu.

[Fernando Harald Barreiro Megino] 11:39:21
The red part is the storage. 7 petabytes, and the yellow part is the 1.5 kB of egress per month, and then well so depending on what the type of compute you use you pay you can reduce it so this is the first one is on

[Fernando Harald Barreiro Megino] 11:39:38
the demand. Second one is if you pay one tier upfront on Amazon, then if you reserve for a year on Amazon.

[Fernando Harald Barreiro Megino] 11:39:46
But you don't pay upfront, and it's a little bit more expensive.

[Fernando Harald Barreiro Megino] 11:39:49
And then you reserve for 3 years, and then you see that the price starts dropping considerably, and the last one for Amazon is the Amazon spot, and you see that the the Cpu part is really much lower than the which is the Amazon on demand.

[Fernando Harald Barreiro Megino] 11:40:08
Then, if we move to the Google part, Google is a little bit, She preferred the Cpu in at least for for the calculations that I did for the egress and the storage, is more or less the same, as amazon and then the very last pin that I took on

[Fernando Harald Barreiro Megino] 11:40:30
the billing report of of the Google Cloud console for the last 30 days under extracted how much we have been spending on each one of the things and to compare it with what I had done in my theoretical calculations the Cpu was a little bit cheaper So

[Fernando Harald Barreiro Megino] 11:40:50
we use Spot. So you have to compare it with the Gcp.

[Fernando Harald Barreiro Megino] 11:40:53
Spot. It's a little bit cheaper. Also I didn't use the full 10,000 cpus, but only in my 2 9,200 for the story to It's much cheaper than the others, but also we don't have the 7 petabytes. Of data.

[Fernando Harald Barreiro Megino] 11:41:12
Yet We have only 1.6 bit byte. So that's that explains it.

[Fernando Harald Barreiro Megino] 11:41:16
And then egress. We are at. We did 1.2 Pedros of E address, according to the Tcp Billing, which is very close to what I had gotten in my models?

[Fernando Harald Barreiro Megino] 11:41:28
and so that's what you would be paying if you would pay list prices.

[Fernando Harald Barreiro Megino] 11:41:33
But again in our user agreement, it's the the what we would keep paying effectively. It's it's lower than that on the this is it for this night.

[Fernando Harald Barreiro Megino] 11:41:46
I see there are Yeah.

[Paolo Calafiura (he)] 11:41:52
quick, quick, question actually It's a call man by the question.

[Paolo Calafiura (he)] 11:41:57
the subscription price is, of course, very advantageous, but it does.

[Paolo Calafiura (he)] 11:42:03
It does kind of remove the elasticity you mentioned, because you know, if you use one Cpu for 10,000 h, you are not using your subscription very well at all, so that that that, was the only comment that I wanted to make and then I and follow our question is anyone on top with with

[Fernando Harald Barreiro Megino] 11:42:16
Hmm.

[Paolo Calafiura (he)] 11:42:22
Amazon about a model similar to this Google subscription?

[Fernando Harald Barreiro Megino] 11:42:28
So about elasticity, not completely because the agreement is for 10,000 little cpus in average, so you could be using one month, 5,000 on the next month, 15,000 But yeah.

[Fernando Harald Barreiro Megino] 11:42:46
If you arrive on the last day, and want to use your average 10,000 digital Cpu: so 15 months on the last day, that will be very difficult.

[Fernando Harald Barreiro Megino] 11:42:57
But the zoom, with your your resources, So there's some elasticity there is, and about the Amazon question, I I don't.

[Paolo Calafiura (he)] 11:42:57
yeah, I did.

[Kaushik De] 11:43:19
yeah, we we. We have not had the conversation with Amazon with Amazon.

[Kaushik De] 11:43:24
We only use credits in the old traditional way.

[Kaushik De] 11:43:33
So in some sense it's good because we have the side by side comparison with Amazon via I'll set them out to fix credit.

[Chris Hollowell] 11:43:56
I yes, it. You know from my experience, the the a lot of the Cloud providers.

[Chris Hollowell] 11:44:01
They're not really guaranteeing a specific Cpu model.

[Chris Hollowell] 11:44:05
It's sort of nebulous what cpu they're they provide.

[Chris Hollowell] 11:44:09
So I mean, I guess the question is, you know, so you have 10,000 chorus.

[Fernando Harald Barreiro Megino] 11:44:21
they do not, they do not tell you exactly what's the Cpu model, but some family.

[Fernando Harald Barreiro Megino] 11:44:31
So, for example, I used the N. 2, and that is cause Kate Link or Ice Lake, I think, would I'm not the Cpu expert but th th those are for Google the newer generations.

[Fernando Harald Barreiro Megino] 11:44:44
And if you take the n one you go to the order generations.

[Fernando Harald Barreiro Megino] 11:44:46
So, yeah, it's you. You don't. Yeah, you're you're more or less right.

[Fernando Harald Barreiro Megino] 11:44:55
That don't know exactly what the Cpu. That's an approximation

[Steven Timm] 11:44:59
Oh, you think hmm

[Enrico Fermi Institute] 11:45:00
Do they not expose anything into the in the Os

[Enrico Fermi Institute] 11:45:05
Okay.

[Steven Timm] 11:45:06
So it Yeah, I had my student actually run this house for most of the new Google instances this summer.

[Steven Timm] 11:45:18
I have the numbers, we we we've got most of the Google aspects available.

[Steven Timm] 11:45:22
I think we want some

[Fernando Harald Barreiro Megino] 11:45:24
I would be interested in having that

[Chris Hollowell] 11:45:26
right, right.

[Steven Timm] 11:45:37
okay.

[Chris Hollowell] 11:45:39
I guess the issue, there though is, since they're not guaranteed any Cpu model.

[Chris Hollowell] 11:45:44
In particular, there would change

[Enrico Fermi Institute] 11:45:57
Here we had a comment from Dirk

[Dirk] 11:46:04
one was about the elasticity which was already covered, so it seems to be possible within within limits.

[Dirk] 11:46:12
But you probably, if you have a 10,000 average, you can't run 120,000 like one month, and nothing the rest of the year.

[Enrico Fermi Institute] 11:46:16
Thanks.

[Dirk] 11:46:18
That's probably not gonna fly. Let's see.

[Dirk] 11:46:22
Seems to me but the the other one was, I think, I mean, we talked about these pricing plots a a a bit.

[Dirk] 11:46:28
I think I finally I think, understood what that last bomb means.

[Dirk] 11:46:35
So that's from the within the subscription. That's from the running counter insight.

[Dirk] 11:46:40
So it's in some sense fake pricing, right?

[Dirk] 11:46:42
It's because you you pay the subscription price, but they still tabulate what things cost?

[Fernando Harald Barreiro Megino] 11:46:48
yeah. So with the subscription, what? And they are doing is all the time filling up our credit.

[Dirk] 11:47:01
Quote unquote.

[Dirk] 11:47:01
Okay.

[Ian Fisk] 11:47:11
Yup, my! It was a question actually which was It is as interesting as to see the various models between the 2 great cloud providers.

[Ian Fisk] 11:47:19
Has anyone done an updated what it's actually costing us to host these things?

[Ian Fisk] 11:47:24
Because I'm looking at these numbers, and I I know the size of the facility that we run versus the cost of the hosting and the operations. And these numbers are dramatically higher than we're paying

[Ian Fisk] 11:47:44
So I'm putting in that right bye, bye, I

[Ian Fisk] 11:47:49
Okay, I am, including in that price the cost of the hosting.

[Ian Fisk] 11:47:53
So what it costs to rent the space, to power, the machines to buy the machines, to operate, the machines, to administer the machines and support people, using the machines

[Fernando Harald Barreiro Megino] 11:48:07
but then everything that is installed on top is, it's not included this.

[Ian Fisk] 11:48:11
Oh, really! What what, What do you mean? What's installed on?

[Fernando Harald Barreiro Megino] 11:48:16
All the services that you are running

[Ian Fisk] 11:48:17
I am, including all like what I mean, services like the back system, and the oh, putting, including all of that, too, putting all of those things

[Ian Fisk] 11:48:28
So I'm including including the 15 person staff that we run the place, who add, plus the cost of the hosting, the facilities plus the cost of operating the storage, external networking, etc.

[Fernando Harald Barreiro Megino] 11:48:40
I mean the list prices in particular. If you go on the month.

[Fernando Harald Barreiro Megino] 11:48:45
so I've been told not to compare. I I I compared it myself with a Usda tool, and if you use on demand instances, it they considerably higher our subscription agreement is very similar to our usda 2 without saying How much we used to costs.

[Ian Fisk] 11:49:01
Right.

[Fernando Harald Barreiro Megino] 11:49:17
because it Oh, the one because because it creates a conflicts and fights.

[Enrico Fermi Institute] 11:49:26
no, but as a quality of service for the the If you're going to compare it to a tier, 2, right is the quality of service that you have to provide for the storage.

[Enrico Fermi Institute] 11:49:38
The same on Google Cloud as it is for midwest tier, 2, for for example.

[Fernando Harald Barreiro Megino] 11:49:45
I mean

[Enrico Fermi Institute] 11:49:46
Because that has an operate that has an operational effect

[Fernando Harald Barreiro Megino] 11:49:51
I mean. My opinion is that the quality of service in Google is something that no, I mean, they simply have thousands of.

[Enrico Fermi Institute] 11:50:02
No, I mean the Atlas service, the Atlas services that are running the Rc.

[Fernando Harald Barreiro Megino] 11:50:02
So the quality of yeah.

[Enrico Fermi Institute] 11:50:08
That runs at Google right? Can it, for instance, be a nucleus so that it can serve out data and stuff like that?

[Enrico Fermi Institute] 11:50:13
That's what I mean by quality of service not their underlying layer.

[Enrico Fermi Institute] 11:50:16
That's that's all good. What I really mean is the Lwlcg layer of services and code that have to run on it to to have it.

[Fernando Harald Barreiro Megino] 11:50:16
So

[Enrico Fermi Institute] 11:50:27
Behave like a typical tier. 2 grid site

[Fernando Harald Barreiro Megino] 11:50:30
I I've been working on the panel site, and that works as good as any It was tier, one or room I mean.

[Fernando Harald Barreiro Megino] 11:50:40
It's got to know if better, because I don't look that much.

[Fernando Harald Barreiro Megino] 11:50:43
I did it. Other sites, but it's completely flat, so there is never wasted cause.

[Fernando Harald Barreiro Megino] 11:50:49
The failure rate is very, very low, and when there is a value rate it's usually done.

[Fernando Harald Barreiro Megino] 11:50:57
It's usually cost to by misconfiguration. Just that.

[Fernando Harald Barreiro Megino] 11:51:01
I don't have the. It's new. We're running it since a month.

[Fernando Harald Barreiro Megino] 11:51:07
And, for example, I underestimated the disk. Things like that this and half a year in my the pandemic will run as good or pay.

[Ian Fisk] 11:51:21
I I guess I do. I would I just go back to my point, which I think is important for the report to follow is that we're always in a situation where we're making a choice in terms of how we allocate the resources and we're always having to cut back on something else to afford so

[Ian Fisk] 11:51:33
that so in in some sense, we're at some point, we're gonna have to make an argument that says, using the cloud is less expensive by so metric.

[Enrico Fermi Institute] 11:51:53
Seekashka has had his hand raised for a while to jump in

[Kaushik De] 11:51:57
yeah. So I wanted to address 2 of the points that we have had extensive discussion on.

[Kaushik De] 11:52:07
One is the elasticity doesn't seem to care.

[Kaushik De] 11:52:11
They're perfectly fine. If you want to use 100,000 cores for one month instead of 10,000 cores, how the duration of the project we are planning to test that when we moved the later parts of our planned program of work And are in these studies with Google But we certainly

[Kaushik De] 11:52:30
plan to test both models. The only reason why we started with the flat model is because that's what our current computing systems are designed for, And we wanted to give that a quick. Test.

[Enrico Fermi Institute] 11:52:39
Okay.

[Kaushik De] 11:52:48
We don't have to continue this way. We could not run anything for 3 months, and then we could run 5 times higher for a month.

[Kaushik De] 11:52:56
It's it's completely elastic up to the the limits of the resources of the data center.

[Kaushik De] 11:53:03
And then, of course, one can scale up by going to multiple data centers.

[Kaushik De] 11:53:06
So that's the elasticity issue. Even with the subscription model, because we have discussed it with them.

[Kaushik De] 11:53:14
I think the cost, comparison issue. I think it's an important one, but I think we have to be a little bit careful, because we will never come to a conclusion if we ask Google or the team that's using Google to come up with the cost off.

[Kaushik De] 11:53:31
A T. I wanna tear to side, I mean, that's just will never work.

[Kaushik De] 11:53:35
You know that it will never work, because every time somebody from the outside tries to value it the cost of.

[Enrico Fermi Institute] 11:53:38
Me.

[Kaushik De] 11:53:42
Do you want N. 2 to decide? It will be. There will be something that people will find to say that it was not done correctly.

[Kaushik De] 11:53:53
So. I think it's that you're one, and 2 sites who actually truly have to do the costing, and they actually have to do the comparison.

[Kaushik De] 11:54:03
And they actually have to decide what is best for them to have on prem resources or offering resources.

[Kaushik De] 11:54:09
And in what particular combination do they want to do it?

[Kaushik De] 11:54:12
I think it's up to the tier. One; and tier 2 sites.

[Kaushik De] 11:54:16
It's not up to the people who Alright, using Google and Amazon And it's certainly not up to the salespeople from Google and Amazon to tell us how to they can do it cheaper all.

[Kaushik De] 11:54:25
We can do. And I think that's what we are focused on doing.

[Kaushik De] 11:54:28
And I think that's really last video updates, 2 plots.

[Kaushik De] 11:54:32
Here is what is the cost of doing this and that on Google and Amazon.

[Kaushik De] 11:54:39
And I think that's how we make progress. Is we alright at as transparent as possible, as many different kinds of tests as possible.

[Kaushik De] 11:54:48
Explore all the possibilities that we can do.

[Kaushik De] 11:54:54
And then we an experimentalist. We do that through this project, over the next 15 months, and then we provide that information Then it is up to tier one tier.

[Kaushik De] 11:55:05
2, since people of various kind to come and argue this way and that way, and I don't think we should be as a technical thing, we should be part of that

[Enrico Fermi Institute] 11:55:15
No; but you have the caution. Yeah, if hold them the lost capabilities.

[Enrico Fermi Institute] 11:55:23
And so I'll use an example as a lot of the engineering was pulled out of the physics departments and went to the National Labs University groups lost capabilities.

[Enrico Fermi Institute] 11:55:34
They couldn't do certain things on detector projects. This will be the same.

[Enrico Fermi Institute] 11:55:39
This we have to quantify that effect? If you were to, for instance, move all the compute to the clown, what would we lose

[Kaushik De] 11:55:46
I complete. I completely agree with you, but those are not part of it.

[Kaushik De] 11:55:50
Technical study of what we can do on Google and Amazon.

[Kaushik De] 11:55:54
Those are really discussions within the field of how we move our field forward.

[Kaushik De] 11:55:59
I I think we should separate the 2. I don't think we should mix up the 2.

[Kaushik De] 11:56:02
I think we should look at the quality of service. I think we should look at the type of service.

[Kaushik De] 11:56:07
I think we should look at the services that are actually global and and and provide that we look at the past.

[Kaushik De] 11:56:17
That's the scope of what we're doing beyond that.

[Kaushik De] 11:56:21
Of course, is up to the field to decide

[Enrico Fermi Institute] 11:56:24
But even in the technical cost thing you, because of the hour, provided labor to it.

[Enrico Fermi Institute] 11:56:29
Don't we also have to capture good? This? The labor needed to to have the same quality of service as same from the experiment, I'd say typical tier 2

[Eric Lancon] 11:56:51
yes, so I wanted to come back on a few statements which were made, I think we're we need to be very careful about the general statement, like it's cheaper than a tier 2 this state those statement.

[Eric Lancon] 11:57:15
Do not represent us at last; and as you should be indicated on the slides, if they are such that statements there, there is a working group with and Atlas which is being set up to supposedly look at the Tco for operating on the Cloud and the Tier 2 so may want.

[Eric Lancon] 11:57:33
to to wait for the conclusion of this working group. What I would like to say is that that Will Kevin?

[Eric Lancon] 11:57:42
Well, very well aware of the cost of cloud compared to on-site operation, because for any big investment we platform comparison of the cost.

[Eric Lancon] 11:57:55
Okay, the on the cloud, including the feminist Google, Do we discount which is being used by Atlas?

[Eric Lancon] 11:58:05
And we have, as it was noticed by Yan Fisk.

[Eric Lancon] 11:58:12
the cost, really prohibitive. I cannot give you exact numbers, because we cannot provide the touches cost.

[Eric Lancon] 11:58:23
But you know really much slower than any solution which is available on the cloud.

[Fernando Harald Barreiro Megino] 11:58:43
regarding your first comment. No one or I didn't hear anyone saying that this is cheaper than Usdo.

[Fernando Harald Barreiro Megino] 11:58:52
I don't know where you got. No, I said, It gets similar with the with the user subscription

[Enrico Fermi Institute] 11:59:01
Okay.

[Fernando Harald Barreiro Megino] 11:59:05
Well, okay. In in any case, explicitly, I didn't put a usual cost, and for the Tcr.

[Fernando Harald Barreiro Megino] 11:59:13
it's what I said at the very beginning, but the Tc.

[Paolo Calafiura (he)] 11:59:27
yeah, I won't. I won't want to want the what. Oh, sorry me.

[Paolo Calafiura (he)] 11:59:35
I didn't see the way ends. Can I go

[Paolo Calafiura (he)] 11:59:39
I apologize. So then I want to make a comment, which is that one thing we have to keep in mind is this derivative?

[Paolo Calafiura (he)] 11:59:50
And so the the costs, the comparison with cost of proud to do on the resources, was just 3 close The first time we did it, which was about 2,016.

[Paolo Calafiura (he)] 12:00:04
I mean it was like order of money. It's more expensive.

[Paolo Calafiura (he)] 12:00:07
And while I agree with Eric, that is, that is not yet done at the cost comparison.

[Paolo Calafiura (he)] 12:00:14
And now it's actually it's actually what to do in the cost comparison, and probably it will come.

[Paolo Calafiura (he)] 12:00:20
It will come still more expensive On the cloud side than than the own side, but not by a factor of 10.

[Paolo Calafiura (he)] 12:00:27
So I think one of the important roles of these investigations is to be ready to be ready in case, for some reason, Google Cloud or aws, they, they can buy cpu and storage at prices We Don't.

[Paolo Calafiura (he)] 12:00:44
Have access to So let's not just see it, as is a short term effort.

[Paolo Calafiura (he)] 12:00:52
But there's an effort which is thinking about what's gonna happen in 5 years.

[Eric Lancon] 12:00:55
no, I agree. I agree. I agree, Paula. We should keep a close eye on the cost, and if the services for for equivalent level of service cheaper on the cloud, we should consider going to cloud solution for some of the application.

[Steven Timm] 12:01:25
So we have a couple of comments. One is that go 2.

[Steven Timm] 12:01:29
3 years ago, just before Covid. It was a very big study done. It's for me.

[Steven Timm] 12:01:35
I tried to. What would it cost to run the Ruben Data Center here as opposed to ringing on Cloud and tried to.

[Steven Timm] 12:01:42
And we tried to cost that all. Okay, do not exactly know all the numbers there, but there was a very comprehensive study that was done, and that's sort of that's A data point.

[Steven Timm] 12:01:51
Yeah, would I could. So we must be familiar with it already. New fees could probably get there for you.

[Steven Timm] 12:02:00
If anybody wants to add, let's come in here

[Ian Fisk] 12:02:01
bye, I think that that those numbers are public, and people want to see them.

[Steven Timm] 12:02:05
Okay? Yeah, huh? Right

[Steven Timm] 12:02:13
Great. No, no! The other thing that we've noticed from 6 years ago, when we first did the big Cms.

[Steven Timm] 12:02:21
Demo on Amazon until now is that probably by pricing is gone up by factor 2.

[Steven Timm] 12:02:26
Hi! Amazon, so you could in 2,016 you could go a 25% amount of your price.

[Steven Timm] 12:02:33
You can't do that anymore and get any cycles the th that's of interest, I think.

[Steven Timm] 12:02:38
And then the third thing is that as far as costing, what does it cost to run it here to on the cloud as opposed to web through our many estimates of that on the website some would agree by factor 4, we're always going to be difference in but sooner or later

[Steven Timm] 12:02:59
we're going to go to. Do we say we need more money to put more computer.

[Steven Timm] 12:03:03
We need more money for another building. And we're not gonna get it. So there will be a limit to how much we can put on a site.

[Steven Timm] 12:03:10
And then that may be the driver eventually to why we need to go to the cloud eventually.

[Enrico Fermi Institute] 12:03:22
Okay, thanks, Keith. We'll go to Tony.

[Fernando Harald Barreiro Megino] 12:03:34
This is 10,000. This is using the cost, I mean do all of the bus except the last one.

[Fernando Harald Barreiro Megino] 12:03:41
It's 10,000 little cpus where you run.

[Fernando Harald Barreiro Megino] 12:03:44
Whatever you want, 7 petabytes of standard objects store, and 1.5 petabytes of egos per month.

[Fernando Harald Barreiro Megino] 12:03:52
It's not it's support, whatever you're using, for it's not strictly related to a simulation

[Enrico Fermi Institute] 12:04:03
Ask a question. Go ahead. You have the the hoax in place, Amanda.

[Enrico Fermi Institute] 12:04:09
Capture, What Cpu model the job reports cause.

[Enrico Fermi Institute] 12:04:16
Then we can turn around and figure out for the number of virtual cpus you've used for some period of time.

[Enrico Fermi Institute] 12:04:22
But the hep specho. Sex equivalent is, and then then one can compare it, at least in you know we know what the tier two's provide in terms of Hs. O.

[Enrico Fermi Institute] 12:04:36
6,

[Fernando Harald Barreiro Megino] 12:04:38
So I I I have not looked into that for many, for most of the great sites it gets reported back as the okay.

[Fernando Harald Barreiro Megino] 12:04:50
Yeah, like the pilot looks for the for that information and reports it back.

[Enrico Fermi Institute] 12:04:52
Okay.

[Enrico Fermi Institute] 12:05:01
Then it might be very interesting to compare that to the benchmark jobs, where you know I should take with enough spread, so that you get the distribution of what they're actually giving you Yeah, because at the end of the day right we get paid in the us dollars per H. S. O.

[Enrico Fermi Institute] 12:05:23
6, okay.

[Enrico Fermi Institute] 12:05:31
Comment from Ian

[Ian Fisk] 12:05:34
I thought Stephen was done, but yeah, I guess But I just wanted to go back to this bel labor.

[Ian Fisk] 12:05:44
This point about cost, and I think the like. I think one thing that we need to assess as a field is to what are the economics that changed as every time we do this evaluation we find that things are a little bit closer to being competitive and at some point maybe they will make the transition over

[Ian Fisk] 12:05:59
but something has to happen which is then basically either the the economy of scale associated with am aws and Google has to be so large.

[Ian Fisk] 12:06:09
But they can do it more or less, or work cheaper than we can, and still make money, And whether that's a location facility, or whether that's the fact, that we use all our resources only a fraction of the time, or whatever But something has to change, because at the end, of the day like for the

[Ian Fisk] 12:06:24
same reason I don't drive a rental car to work.

[Ian Fisk] 12:06:28
if you have a facility, if you have a facility which you're using all of the time was you operated yourself.

[Ian Fisk] 12:06:33
It's very hard for someone to undercut you, unless they can, either.

[Ian Fisk] 12:06:38
They're so large, or they're so cheap, or they bring in cheaper places something we must be able to identify the thing that is going to make it competitive.

[Gonzalo Merino] 12:06:58
yeah, so, just hi. Briefly a brief comment just wanted to subscribe to some a previous comment from Kaoshik And I must say I'm a little bit surprised about all this this discussion, whether it is cheaper, or more or more expensive I really think that the I mean has like 170 sites

[Gonzalo Merino] 12:07:17
so that the answer will totally be different. For each of those sites.

[Gonzalo Merino] 12:07:20
So I think they're going through a Gaussian, which I totally subscribe Is that the value in this exercise, or at least part of it is, Okay, we need to get these numbers like the fernando show so that's super useful.

[Gonzalo Merino] 12:07:31
So what's the cost of running this in the cloud in a commercial cloud?

[Gonzalo Merino] 12:07:35
And then is, is for each of those 170 sites to get this number, and then compare to their internal costing which completely, will change.

[Gonzalo Merino] 12:07:43
I mean depending on size, depending on country. The labor cost is factors.

[Gonzalo Merino] 12:07:49
Difference from different countries, So so I mean I I I don't think I mean discussing on whether it's more expensive or or more or cheaper than these on that, side I think it's useless, and whether it's Fermila or a Tier 2 here in Czechoslovakia

[Gonzalo Merino] 12:08:01
or in Spain. It's it's for each of the sites in every country to get this number compared to their cost, which everybody knows.

[Gonzalo Merino] 12:08:10
And then react accordingly. I would say that that's the value I see, and the and the example from the rental card.

[Enrico Fermi Institute] 12:08:37
shooting.

[Shigeki] 12:08:41
yeah. One comment I have is this: There, there's really no incentive for any of these cloud providers to become the lowest cost per provider They're in the business to make money, right?

[Shigeki] 12:08:54
And they have hordes of the accountants and supercomputers that are that are constantly hedging the cost of everything right there.

[Shigeki] 12:09:03
There, there, there business model is value. Add not not to drop to the lowest, lowest cost per provider.

[Shigeki] 12:09:11
Right.

[Dirk] 12:09:21
the power of competition, I mean they are competing against each other

[Dirk] 12:09:29
So

[Enrico Fermi Institute] 12:09:34
Yeah, yes.

[Dirk] 12:09:36
I mean, that's what the isn't. This saying in any mature market?

[Dirk] 12:09:39
The price of a service will basically go down so that the profit approach to 0

[Enrico Fermi Institute] 12:09:48
Have you flown recently That's a mature market, and the prices are going the other one.

[Dirk] 12:09:56
Well, big slash supply. So that's that's the thing about these data centers.

[Fernando Harald Barreiro Megino] 12:10:09
Okay, I think we can go then to the next slide with K.

[Fernando Harald Barreiro Megino] 12:10:16
You.

[Kenyi Paolo Hurtado Anampa] 12:10:19
okay. Yup

[Kenyi Paolo Hurtado Anampa] 12:10:28
We'll take that as now. So Okay, moving on on the Cmx experience.

[Kenyi Paolo Hurtado Anampa] 12:10:33
So I tried to summarized and boot. Have a few numbers there from those sources.

[Kenyi Paolo Hurtado Anampa] 12:10:40
This is from one paper and some slides of the world.

[Kenyi Paolo Hurtado Anampa] 12:10:45
That was done 5 6 years ago, when Amazon on Google Cloud.

[Kenyi Paolo Hurtado Anampa] 12:10:49
So again, this numbers are not upgrade, are from 2,016 to substance 17.

[Kenyi Paolo Hurtado Anampa] 12:10:57
So things have changed. But the conclusion, the high, level summer of the the conclusion there is that the cause of per core hour for both aws and Google Cloud were close to similar But then the work that was on an amazon was on over the cover of a few days about 8 days and you can see in

[Kenyi Paolo Hurtado Anampa] 12:11:20
the stop. Right. Plot the and th the green. That is the production on awws.

[Kenyi Paolo Hurtado Anampa] 12:11:29
Just in, kept out on the formula side, and then on the bottom block at It's what do you have from Google Cloud?

[Kenyi Paolo Hurtado Anampa] 12:11:42
And the work on Google cloud was done over the course of about 4 days.

[Kenyi Paolo Hurtado Anampa] 12:11:47
the goal was to double the size in terms of total available course.

[Kenyi Paolo Hurtado Anampa] 12:11:57
With respect to the what we had in the global pool for the demo, the demo was done using.

[Kenyi Paolo Hurtado Anampa] 12:12:04
Yeah, that production, simulation workflows and the on premises.

[Kenyi Paolo Hurtado Anampa] 12:12:11
Estimate that I put on the paper on the the slides are, what's the estimate on the paper? Again?

[Kenyi Paolo Hurtado Anampa] 12:12:18
You have the services linked from Archive. They're in this.

[Kenyi Paolo Hurtado Anampa] 12:12:25
and then the the other factor just focus on the operational airport.

[Kenyi Paolo Hurtado Anampa] 12:12:32
So for this I got you put from the team from Cloud and the completion is that there was initial effort mostly related to monitoring.

[Kenyi Paolo Hurtado Anampa] 12:12:47
And this was to prevent waste of computer resources to track It's stuck jobs or jobs getting or going to slow identify.

[Kenyi Paolo Hurtado Anampa] 12:12:59
I bones identify huge log files that call that the cutting current to high transfers.

[Kenyi Paolo Hurtado Anampa] 12:13:06
But then after that, the current maintenance is low in terms of effort, with an estimate of just one.

[Kenyi Paolo Hurtado Anampa] 12:13:17
If you for that, occasional for things like, for, for, for example, they th this it is still maintained up to up to today.

[Kenyi Paolo Hurtado Anampa] 12:13:27
For okay. Tms wants to use it again. And so I I feel a few months ago they didn't work on integrating support for Id Token so that you can, for example, for both main, number you can go.

[Kenyi Paolo Hurtado Anampa] 12:13:49
So alright. But we have the last slides with this basically strategy, considerations and discussions, and these are just some bullets to.

[Kenyi Paolo Hurtado Anampa] 12:14:01
We talked a lot about cloud costs already, and there are some other bullets there right related to the egress. Cost what is the role?

[Kenyi Paolo Hurtado Anampa] 12:14:10
On the other cloud in double your Cg. Discussions how to make the cloud.

[Kenyi Paolo Hurtado Anampa] 12:14:26
No, we we are actually at the end of the schedule.

[Kenyi Paolo Hurtado Anampa] 12:14:32
So it's it's lunch. Break, or is I don't know.

[Fernando Harald Barreiro Megino] 12:14:38
we still have a little bit of time. So I think that I mean on the on the cost.

[Fernando Harald Barreiro Megino] 12:14:46
People discussed it a lot, but this is the opportunity to to discuss any other like where your discussion, like, for example, egos, costs or other worries about the Cloud, or are there any particular ideas of how we can make better use of the cloud like to exploit elasticity

[Fernando Harald Barreiro Megino] 12:15:13
that I use what whatever Gpus or it is, discussed. Lansium.

[Dirk] 12:15:32
I I just wanted to it. We already talked about elasticity, and I just wanted to maybe focus on one of the points on the on the slide.

[Dirk] 12:15:42
Say that the different planning horizon versus our own equipment, and that gives you kind of a different layer of elasticity, because when you purchase equipment, it's not only that you have a certain number of deployed deployed cause in your data center but it's also when you purchase

[Dirk] 12:15:58
the equipment you usually basically make a commitment for the next 3, 4, or 5 years.

[Dirk] 12:16:04
Whatever the retirement window now is for for hardware that you buy.

[Dirk] 12:16:08
It's gone up a bit cloud. You don't have to make that commitment.

[Dirk] 12:16:15
Now. The the thing is, though, in our science usually we have pretty stable workloads, so we can't really take full advantage of that.

[Dirk] 12:16:23
So usually we buy equipment for 4 years, and we we expect, I mean, we have year to.

[Dirk] 12:16:30
Year We have always the work to get busy, but I'm not looking out.

[Dirk] 12:16:35
There's the the dip in and before Hlc.

[Dirk] 12:16:42
Comes up. I don't know if that's something that where Cloud maybe could help.

[Dirk] 12:16:48
If you basically, if at that point we were like 20% cloud, you could say, Okay, for these years, for the off years shutdown years, you could just not buy any cloud cycles.

[Dirk] 12:16:59
I'm not sure how that would play with subscription at the renewal like. If you are in a subscription model, and you could just skip a renewal and then resume. A year.

[Dirk] 12:17:08
Later. But that's a that's a possibility. And you don't kind of you really don't have that with purchase equipment, because you you you kind of continuously keep buying equipment.

[Dirk] 12:17:21
Just to not have like With everything retired all at once.

[Dirk] 12:17:24
I mean, you kind of cycle over your data center of all.

[alexei klimentov] 12:17:41
I I think this is a very simplistic approach.

[Enrico Fermi Institute] 12:17:41
Still had Alexa

[alexei klimentov] 12:17:48
I I I think what at least we are trying to do in Atlas.

[alexei klimentov] 12:17:53
We are trying to integrate calls in our computing model, and it is not Oh, we described. I want to remind you what one of the first topics at least which I remember about using clothes for was done by Bell experiment.

[alexei klimentov] 12:18:14
Not by Bill, but by Bill when they needed a Monte Carlo campaign to conduct a Monte Carlo campaign, and the way you signed it.

[alexei klimentov] 12:18:23
But for wham it is cheaper just to buy cycles and to run this Monte Carlo campaign.

[alexei klimentov] 12:18:32
So I I think just this comparison, and also what was mentioned before by several people.

[alexei klimentov] 12:18:40
But is it replacement of how with your tools?

[alexei klimentov] 12:18:44
Of course not. Post notice, not replacement, but it is resources which we can use, and elasticity for me.

[alexei klimentov] 12:18:52
It is one of them. Main features which we can use, and as Paula mentioned also before we go to purchase something new what we don't have now, we can try it in the cloud.

[alexei klimentov] 12:19:07
I also kind of disagree with statement with our workforce.

[alexei klimentov] 12:19:12
very stand up, or whatever you use, because what we see even now, and I think it will.

[alexei klimentov] 12:19:19
be in this direction, but you have new workforce, more complex for falls, which we at least an office.

[alexei klimentov] 12:19:29
We did not have during ground through and for high luminos, she will be more and more like that.

[alexei klimentov] 12:19:33
So that's why I think that's the problem is more complex, And we need to address it on more or complex way, and not try to what I'm afraid.

[alexei klimentov] 12:19:45
But if you start to, you know, split it on small pieces of, then we all know.

[Eric Lancon] 12:20:02
yes, sorry was Mr. Do you agree with Alexey that they are?

[Eric Lancon] 12:20:09
There are more complex workflows coming, and there are a need to adapt.

[Eric Lancon] 12:20:13
why, the what? I don't follow fully, the conclusion is that the cloud is most suited for this.

[Eric Lancon] 12:20:22
the facility needs to work to adapt to the new requirements.

[Eric Lancon] 12:20:28
And that's what make at the end the comparison.

[alexei klimentov] 12:20:46
if you could find my comment, I fully agree with you. I I fully agree with you, and that's why.

[alexei klimentov] 12:20:52
what will be. Try and pineapple, that is, a full chain for me.

[alexei klimentov] 12:21:00
It is bigger ones. And first 8 days also to try.

[Enrico Fermi Institute] 12:21:29
so one comment I had, I mean we we've spent a lot of time talking about how the clouds hook into the existing workflow system, panda and Whatnot.

[Enrico Fermi Institute] 12:21:39
And does it make sense to further, you know, talk about, or explore how clouds can either be used as an analysis facilities or extending analysis facilities in some way you know 1 one of the things that that you know the users might want work for example, or things like you know exotic type

[Enrico Fermi Institute] 12:22:01
of things, or or accelerators. You know. Gpus, things like that.

[Enrico Fermi Institute] 12:22:05
Can we use clouds to can sort of pat out those kind of resources that analysis facilities? Does that? Does it make sense to explore that

[Fernando Harald Barreiro Megino] 12:22:14
so in in all of the problems, both in that there there is always a possibility for the user to get account and really to whatever they need.

[Fernando Harald Barreiro Megino] 12:22:31
if it's more of a a central analysis, facility

[Fernando Harald Barreiro Megino] 12:22:39
The the analysis facilities that we are usually talking about, the Atlas Or Cms.

[Fernando Harald Barreiro Megino] 12:22:46
For that there will also be in the Atlas Project, and R. And D.

[Fernando Harald Barreiro Megino] 12:22:51
To to extend that. And okay, because presented some ideas to do that last week or 2.

[Enrico Fermi Institute] 12:23:14
It is Mike supporting. It so something that was interesting, and I don't have it right at my fingertips.

[Enrico Fermi Institute] 12:23:20
But Purdue actually got a Purdue university, actually got a pretty big grant from Google to set up a system where basically their badge system can burst into the Google cloud But they also have all the vpns and whatnot set up And the images are the same image as their you know, their

[Enrico Fermi Institute] 12:23:42
compute farm is, and with the VPN setting up the networking or whatnot, the the remote hardware that the cloud hardware is put on, quote the same as just the regular best they have there so You know outside of latency, or whatever you're you're basically can run just

[Enrico Fermi Institute] 12:23:57
slam in contour, or I think they run the storm there. You could slam in slum jobs and run whatever you want, So there's definitely work that's been done.

[Enrico Fermi Institute] 12:24:32
So maybe to bring up another topic from yesterday, we and we we mentioned here a little bit about, you know, using Cloud to run some kind of particular campaign or What have you does does that have any effect on on how we think about pledging clouds

[Enrico Fermi Institute] 12:24:53
And then, general, are there any any discussions over, want to have about pledging clouds

[Enrico Fermi Institute] 12:25:04
Turk. You want to jump in.

[Dirk] 12:25:06
yeah, I think I think cloud the the cloud fits into the discussion we had yesterday with pledging.

[Dirk] 12:25:15
I think, under the current rules to pledge a cloud, you would have to pledge a certain minimum amount.

[Dirk] 12:25:22
Of course. So if you replicate aside where you business always give like run, keep 4,000 calls running.

[Enrico Fermi Institute] 12:25:23
Yeah.

[Dirk] 12:25:29
You could pledge the 4,000 cores, but you couldn't, couldn't really take advantage of elasticity.

[Dirk] 12:25:35
So you kind of would have to pledge to lower boundary, because at the moment, with within some limits, because even even grid sites are allowed to cool below the floor for limited amount of time, I think so But but it puts limits on your on your basically how flexible you can use the

[Dirk] 12:25:53
resources The same problem we have with the scheduling on the Hpc.

[Dirk] 12:25:57
That that you basically you can't just keep the keep it off for 10 for 11 months of the year, and then use up everything in a month that wouldn't work with How the pledges are structured right?

[Dirk] 12:26:08
Now, and what the Rules are.

[Enrico Fermi Institute] 12:26:09
We pledge Hs. O. 6. Not course

[Enrico Fermi Institute] 12:26:21
So I but I. The point is that we have to figure out right.

[Enrico Fermi Institute] 12:26:27
If you gotta even consider pledging cloud research how to put it in a unit that is consistent with what we have So it's an apple staples.

[Steven Timm] 12:26:59
Yes, I was going back to the question of exotic resources.

[Steven Timm] 12:27:04
And I know they come with me yesterday that the exotic resources, such as the p machines of Amazon, the the Fpga is in the tensor, things or whatever are always the most highest price things you can get but you still have to weigh that as opposed to having more having them sit on

[Steven Timm] 12:27:21
site as somebody on premise, somebody singing and sucking up for all the time.

[Steven Timm] 12:27:25
And not being used all the time, at least, we don't yeah have a Dc.

[Steven Timm] 12:27:31
Use case for gpus or tensorflow with your fees, or whatever it was about that.

[Steven Timm] 12:27:37
So there is value, and I've heard that from management that they prefer.

[Bockelman, Brian] 12:28:08
yeah, I I just wanted to to. Maybe tackle something.

[Bockelman, Brian] 12:28:13
But what Doug said a little differently. It's I I'm worried less about the have spectacle.

[Bockelman, Brian] 12:28:20
6 equivalent. But the the fact that for cloud resources you probably need to pledge and Hep Speckle 6 h. Right?

[Bockelman, Brian] 12:28:30
Right, we we you know. It's it's the difference between kill a lot versus kilowatt hours, you know, at some aspect of the pledge.

[Bockelman, Brian] 12:28:39
Or, again, going to the the power. Grid analogy needs to be in kilowatt hours.

[Bockelman, Brian] 12:28:45
and what what what the benchmarks is, and I think it's less important.

[Bockelman, Brian] 12:28:49
but you know, How do you come up with a proposal that balances the fact that you do need some base capacity, and that's that's important.

[Bockelman, Brian] 12:28:59
But we it's very unlikely. A 100% of our hours need to be the the base capacity.

[Bockelman, Brian] 12:29:06
So, some combination of kill a lot and kill what hours and earth yeah analogies in our pledges.

[Johannes Elmsheuser] 12:29:20
right, a follow-up comment to this right, and at the end the pledges are, always as you say, a unit per year, right?

[Johannes Elmsheuser] 12:29:31
And we don't have for it's a unique Cpu architecture as well, right?

[Johannes Elmsheuser] 12:29:37
So there's always over the years with all the people appropriate, human, different, different kind of Cp architecture.

[Johannes Elmsheuser] 12:29:47
Still? What what and what's that before? Right? We we have more or less the same problem also on the grid.

[Johannes Elmsheuser] 12:29:57
We are also averaging there. So we don't have the same unit over and over at the same site.

[Johannes Elmsheuser] 12:30:03
Right. So in principle we are solving here, then, on the cloud the same problem.

[Johannes Elmsheuser] 12:30:08
So I I don't see this really as as proper automatic in that sense, because we we have exactly done the same thing, or plus 1015 years in the grid

[Bockelman, Brian] 12:30:18
yep, I I I don't think I'm following, cause what what we pledge on the grid is certain. Heps.

[Bockelman, Brian] 12:30:27
Spec, Oh, 6 capacity that that is available. Starting at a given time period.

[Bockelman, Brian] 12:30:33
Right, let me say we. Oh, but but it's it's

[Johannes Elmsheuser] 12:30:34
Right? And that's for one year, right? It

[Johannes Elmsheuser] 12:30:40
It's good for one year, and and at the site you don't have a specific unit unit of one Cpu: right?

[Johannes Elmsheuser] 12:30:47
You have always an average, and that was the argument before that.

[Bockelman, Brian] 12:30:50
Oh!

[Bockelman, Brian] 12:30:55
Hmm! No, no, no! But that's very different. It's it's not the average right cause.

[Bockelman, Brian] 12:31:01
I I can't come in and give you 12 times as much capacity.

[Bockelman, Brian] 12:31:03
In January, and and 0, it out for the next 11 months.

[Bockelman, Brian] 12:31:07
That that is most definitely not what the the mo use say.

[Bockelman, Brian] 12:31:12
It's very specific. He spectacular. 6 count available you know, depending on whether you're tier one or tier, 2.

[Bockelman, Brian] 12:31:19
I figure what the number or 85, 95% of the time.

[Ian Fisk] 12:31:27
right.

[Johannes Elmsheuser] 12:31:28
sure I but I agree that you you give an average basically over a certain time period.

[Johannes Elmsheuser] 12:31:34
I think we we agree here right and and as you say, we then have to say, Okay, you provided this.

[Johannes Elmsheuser] 12:31:41
Then 4 months, or for 3 months, or something like this. And this is then the pl.

[Ian Fisk] 12:31:49
No, I I guess also I'd like to argue that our pledging model, as it's right now, is probably not ideal, for that we have a model which is based on the fact that we have dedicated facilities we've been purchased, and the experiment's responsibilities to

[Ian Fisk] 12:32:04
demonstrate that over the course of 12 months they can average. Because they can use them in some average rate, that we both provision and schedule for average utilization and whether it's Hbc.

[Ian Fisk] 12:32:14
Or whether it's clouds, there's an opportunity to not do that, and we might find as collaborations that the ability to to schedule 5 times more for some period of a month, and allow you to hold them on a call for a year done was actually a much more efficient use of people's

[Ian Fisk] 12:32:32
time, and that our current existing, pledging model is sort of limiting.

[Ian Fisk] 12:32:36
I think they they. I believe Maria Geron, who's connected, presented this at Chef Osaka.

[Ian Fisk] 12:32:41
Probably 6 years ago. The concept of scheduling for peak, and it seems like we, because we have dedicated resources, and we have to show that they're well.

[Dirk] 12:33:25
yeah, and maybe maybe one complication with scheduling for Peak.

[Dirk] 12:33:30
You actually have to think about and justify using what you want to use for the peak.

[Dirk] 12:33:36
So it's it's more complicated to plan this, and steady state is You just keep it busy

[Ian Fisk] 12:33:39
it is more comfortable. No, it's it's it's more complicated to plan.

[Ian Fisk] 12:33:44
It requires people to be better prepared. It requires people to.

[Dirk] 12:33:47
Yeah. But that's maybe why it hasn't happened yet.

[Ian Fisk] 12:33:49
I right, but at at the same time it would allow, like, imagine that a 6 month Monte Carlo campaign was a one month, Monte Carlo campaign, and then Sp.

[Ian Fisk] 12:33:58
5 months, where people having to complete set for analysis, that might be a much more efficient.

[Ian Fisk] 12:34:04
And that's also, I think, a motivation for why you might want to go to clouds rates, we see, even if they were on paper more expensive, because you'd have to make some metric which is how much time people's time you're saving

[Enrico Fermi Institute] 12:34:17
which time are you trying to say you're saving

[Ian Fisk] 12:34:22
I would claim Oh, well, the entire collaboration time to physics, Perhaps I'm saying

[Enrico Fermi Institute] 12:34:23
Which people which people's time

[Enrico Fermi Institute] 12:34:34
How do you accurately measure without drawing a false conclusion?

[Ian Fisk] 12:34:40
Hi! I don't. I think it's difficult to.

[Ian Fisk] 12:34:42
I I think it's probably somewhat difficult to measure the inefficiency that we have right now, but I think you can.

[Enrico Fermi Institute] 12:34:48
Okay.

[Ian Fisk] 12:34:49
I think, without drawing a false conclusion, I think I can claim that the this particular way it's set up right now is designed to optimize a specific thing which is the utilization of particular just resortions

[Ian Fisk] 12:35:14
and that's I guess I'm claiming that's not the like.

[Ian Fisk] 12:35:18
If I assume that's the most important thing, because we spent all this money buying dedicated computers.

[Ian Fisk] 12:35:23
Yeah, that's a reasonable thing to say. We're not gonna let these things today.

[Ian Fisk] 12:35:27
We're not gonna over provision, but I think it's it's it's very difficult to say that you can state the that optimization was designed to use this particular resource happens to also be exactly the perfect optimization.

[Ian Fisk] 12:35:40
For these other kinds of resources like time to physics, like what a like!

[Dirk] 12:35:56
all efficient use of resources. I mean, that's the one thing, Cloud, and and you buy the re.

[Dirk] 12:36:02
That's the one main difference. I I see you. You buy resources.

[Dirk] 12:36:07
You have them sitting on your floor, you might as well use them, because it's already paid for.

[Dirk] 12:36:10
So it's already paid for. So at that point, use doesn't okay energy costs whatever.

[Dirk] 12:36:14
But you, you kind of have to keep him busy. Hbc.

[Dirk] 12:36:16
And Cloud, You kinda have to. You justify because you're more elastic.

[Dirk] 12:36:19
So you get the allocation, and especially with Cloud. You You wanna make use of like flexible, elastic, scheduling.

[Dirk] 12:36:28
So at that point you have to justify each use So it's it's more complicated to to do that.

[Dirk] 12:36:34
But hopefully, if if you do it right, you get a more efficient use of resources out of it.

[Enrico Fermi Institute] 12:36:43
But how do you measure that

[Dirk] 12:36:46
It's I don't know.

[Enrico Fermi Institute] 12:36:50
Because think of it, this rate is a 10% cut of what we're doing Now, as you let's say that 10% diverts to the cloud. Then you have to see if that 10% divert the 10% diversion would give, you more bang for the park

[Ian Fisk] 12:37:21
and I. Well, we we actually we did this only a standpoint in a country way for disaster.

[Ian Fisk] 12:37:28
Recovery, which would be, What would it cost you? The scenario is, I've messed up my reconstruction I need to reprocess things, and I only have a month.

[Ian Fisk] 12:37:39
what is there Is there a model which says, there's a reasonable insurance policy which says, I'm gonna use the cloud for that kind of thing.

[Ian Fisk] 12:37:45
And so in some sense, you can make arguments for like, where this is valuable in very specific situations like there's been a problem.

[Johannes Elmsheuser] 12:38:25
I have a completely different common to a question. On the third point you have here, with the bullet point data.

[Johannes Elmsheuser] 12:38:32
So safeguarding. Is this something of concern or not?

[Johannes Elmsheuser] 12:38:40
To all. Just we just say the with your team has to basically safeguard our data for well to against users who are repeatedly downloading this.

[Johannes Elmsheuser] 12:38:54
And and then we are safe. What is there something behind the other?

[Fernando Harald Barreiro Megino] 12:38:56
what.

[Johannes Elmsheuser] 12:38:59
Something other behind this data, safeguarding keyword.

[Johannes Elmsheuser] 12:39:02
Here.

[Fernando Harald Barreiro Megino] 12:39:03
Well, that's a comment that sometimes I hear that you don't want to have the like.

[Johannes Elmsheuser] 12:39:27
Okay, right, So that that's the computing model that you have always the, so to say, another unique copy of your raw data.

[Johannes Elmsheuser] 12:39:40
For example, in the cloud that would be behind that

[Fernando Harald Barreiro Megino] 12:39:43
yeah, So like, what overall the role is it like Can a cloud be a nucleus?

[Fernando Harald Barreiro Megino] 12:39:50
Can so for Cloud only be treated as about 10 temporary.

[Fernando Harald Barreiro Megino] 12:40:00
so th the point is to let people express any any worries regarding this

[Ian Fisk] 12:40:12
I guess I would like to express a worry regarding that which is that I don't think that any reasonable funding agency is going to let you make a custodial copy of the data in the cloud because there's no guarantee that they don't change the rate to become

[Ian Fisk] 12:40:28
prohibitively expensive to move things out or prohibitively makes best move things in.

[Ian Fisk] 12:40:33
And in the same way that the agency won't let you sign a a 10 year lease on a fiber without tremendous amounts of negotiation.

[Ian Fisk] 12:40:40
They're not going to allow you to make a commitment in perpetuity for data storage.

[Ian Fisk] 12:40:44
So I think that actually almost by definition puts the clouds in a very particular place in terms of storage and processing to things that are transient, and things that can be there recorded at the end of the Job and the things that are done at the end because otherwise you're in the situation

[Kaushik De] 12:41:16
yeah, coming back to the question of how to make the most out of the case.

[Kaushik De] 12:41:20
I mean one of the things that we have heard a lot over the past many years actually are the Ai Ml tools and capabilities and ecosystem on the cloud is that something we should continue to pursue is that something that should be added to the list in terms of are we missing out on something

[Enrico Fermi Institute] 12:41:33
Okay.

[Kaushik De] 12:41:47
or is that something that we think know how to do better with our own tools?

[Dirk] 12:41:55
there is a session in the afternoon actually on and D.

[Dirk] 12:41:58
It's specifically a machine learning, training, And we actually have an invited talk from Son.

[Dirk] 12:42:04
I think they I I think it's Hbc.

[Dirk] 12:42:07
Training on Hbc: but it's similar, I mean, it's both Hbc.

[Enrico Fermi Institute] 12:42:21
It's also the case that the clouds do have some kind of proprietary exotic cards right that they that aren't available to the general public that are really meant for machine learning applications.

[Dirk] 12:42:37
yeah, but they they had. The bigger question is, then, what role will machine learning play in?

[Dirk] 12:42:46
In our basically computing operations, going going forward. And I I don't know. We have the answer.

[Dirk] 12:42:50
Neither seems, not Atlas. The final answer on that.

[Dirk] 12:42:53
So it's a bit hard to to say. This is the way to go.

[Kaushik De] 12:43:02
I mean the one thing that yeah, I think we are.

[Kaushik De] 12:43:11
You know we have been trailblazers in many, many areas, but I think in when it comes to the production use of aiml when it comes to everyday use of aiml.

[Kaushik De] 12:43:26
I think cloud and business systems that do so much of it.

[Kaushik De] 12:43:34
how do we, or pull that up and access that?

[Kaushik De] 12:43:40
And I'm not just paranoid, but to me, for for or perfect production level activities, because I noticed that almost anything that we look okay nowadays that Google does anything from their own products like maps and this that to services, that they are provide I mean it's really heavily

[Kaushik De] 12:44:08
dominated with aiml. I mean, it's almost exclusively that we Dml. But are we?

[Dirk] 12:44:21
let me. Maybe I can make a comment because the like can.

[Dirk] 12:44:25
You yesterday showed a used case. Cms, where they basically ran a miniod production, which is basically you take the the aod, which is a larger analysis format, and then slim it down and do some recomputations.

[Dirk] 12:44:37
To get it to a Miniod, which is smaller and actually useful.

[Dirk] 12:44:40
Analysis, and they They are pushing for the model where they do machine learning algorithm.

[Dirk] 12:44:47
They basically use algorithm does use machine learning. But then, during the production phase, you run only the inference server. So it's not actually you're not running the the learning.

[Dirk] 12:44:55
And that's that's for me, is the bigger question.

[Dirk] 12:44:58
Because if you do a one time shot where you're done, you run your learning algorithms on a bunch of data that we have.

[Dirk] 12:45:04
You figure out what you want to do, and then you only run the inference.

[Dirk] 12:45:08
During the heavy lifting reconstruction. Whatever else you do, then that's I'm not sure to what extent this is really impacting the overall computing operations.

[Kaushik De] 12:45:32
Yeah, And another aspect of this is that elasticity comes in when you talk about training, I mean, unless you go to control continuous training models, people are trying to do so.

[Dirk] 12:45:57
how much these large training runs, how much capacity is.

[Dirk] 12:46:03
Are we really talking about is is that making an impact on our overall compute resource use

[Kaushik De] 12:46:28
yeah, and we under H speed the service already in that. That's as a service.

[Dirk] 12:46:28
Okay, So that.

[Ian Fisk] 12:46:43
I think that's probably one of the ideal applications for primarily for Hpc.

[Dirk] 12:46:46
Yeah.

[Ian Fisk] 12:46:48
Because they already have that kind of hardware, and it doesn't.

[Dirk] 12:47:03
The the one thing, though, is it? This kind of application? Will Will it goes, and we will make a comment on under the report.

[Dirk] 12:47:10
But it it by design. It kinda happens outside the current production systems and infrastructure So it's kind of standalone so I'm not sure to what extent it's it's really in scope.

[Dirk] 12:47:22
For the report

[Ian Fisk] 12:47:22
I I I think this is one of the places where the concept of scheduling for peak comes into play, because, as you go to more machine learning things that require training and high parameter tuning, before you start running you change when the computing is spent, you spend the computing beforehand, and

[Dirk] 12:47:37
Yes.

[Ian Fisk] 12:47:39
then it's much faster on things like inference. And so it is a place where, like the model that says we're gonna use them all in Dc.

[Dirk] 12:47:56
And it also, I mean, it's that's even what, where I see him watchings.

[Dirk] 12:48:01
If if if this like, exploring the sinking out of it, the pledging, such resources, if you assume that this resource use is significant, you want to be able to pledge it.

[Enrico Fermi Institute] 12:48:14
Okay.

[Dirk] 12:48:15
But it's a single perp purpose pledge, which is completely outside the the scope of what w pledging currently is.

[Dirk] 12:48:22
But you want to get some kind of credit for such a used case, so that's that's even worse than than just what we discussed so far, which is basically just adjusting the the pledging to be more.

[Dirk] 12:48:37
Like a time, integrated value, not just the in the Ac.

[Ian Fisk] 12:48:41
right, and and this, and the kind of resources we're talking about here are the most expensive things we have.

[Dirk] 12:48:41
Dc. Argument

[Enrico Fermi Institute] 12:48:54
So maybe that needs to be written in the final report, so that they get there's the idea to push for flexibility

[Enrico Fermi Institute] 12:49:12
Because it is a different thing. You really do want to use for the training stuff that's designed for it work so much better

[Enrico Fermi Institute] 12:49:22
Which makes it special cause. I specialized until our code stack uses.

[Dirk] 12:49:40
I mean, we're trying that, too. It's if we had. This is.

[Dirk] 12:49:44
This is active area of on D trying different approaches. I mean Cms: We have the the hlt.

[Dirk] 12:49:50
That's attracting Hot tracking basically runs on Gpu, And that says pretty significant speed up.

[Steven Timm] 12:50:13
good student, just with you guys in Lensium, but also for some of the other more exotic resources, is even more probably on the Hps on The Lcf.

[Steven Timm] 12:50:23
System instead, that there are opportunities for things that can be opportunistically can go and grab a couple, or the computer come back with useful stuff.

[Steven Timm] 12:50:36
there. You may want to think about. Do you need? Is there a sense redesigned the workload that has to happen to best exploit those kind of resources because some some more folks are more.

[Steven Timm] 12:50:52
If you pre, you lose everything, Basically, if you're running for 10 h, you get 12 to go, or something like that.

[Steven Timm] 12:50:58
So I mean, we hit on, for instance, that you could only get a 24 h job link if you submitted at least a 1,000 jobs.

[Steven Timm] 12:51:08
Say Rosa is, let me consider. Okay, I don't have any answers for that, but something you should keep in mind when you're planning or non conventional resources.

[Steven Timm] 12:51:20
If you make sure you can get more stuff done

[Dirk] 12:51:23
I I think that's that's where that's one of the differences between the approaches and Targeting Hbc: But that's that's mostly affects Hbc because cloud cloud just allows you to schedule whatever you're paying for it.

[Dirk] 12:51:35
So they they don't

[Steven Timm] 12:51:38
Well, Lensium can go down any time right

[Dirk] 12:51:40
They can; but in practice, I mean, if they go down every 30 min, it it probably would become unusable for us, so we kind of rely on the fact that, in in in essence, even though what what in principle can go down every 30 min It doesn't actually happen all that often and we we cover

[Dirk] 12:52:00
whatever we make it an efficiency problem. Basically, I'll I'll fail your handling codes and Our software.

[Dirk] 12:52:06
Stack can deal with it, and it just becomes an efficiency issue that goes goes into the cost.

[Dirk] 12:52:10
Calculation. I think, if it gets it gets more complicated than that, it becomes really really problematic to use the resources, and I know that Atlas has the harvest the model in principle, you can survive.

[Dirk] 12:52:23
Like you can make use of of very short time windows.

[Dirk] 12:52:28
But we don't have that in Cms, and I'm not sure how effective that is for Atlas, either

[Fernando Harald Barreiro Megino] 12:52:46
check. Can you link on What do you do you think we should close this session?

[Dirk] 12:52:56
Yeah, it's only I mean, it's less than 10 min. There.

[Dirk] 12:52:59
There was some talk about maybe putting one on the talk early, but that's not enough time, and that would probably trigger discussion.

[Enrico Fermi Institute] 12:53:00
The

[Dirk] 12:53:07
So we can go with it first in the in the next session.

[Enrico Fermi Institute] 12:53:11
Yeah, I think the discussions we've been having less 10 or 15 min lead nicely into the R.

[Enrico Fermi Institute] 12:53:17
And D Presentation.

[Enrico Fermi Institute] 12:53:25
Maybe we we break here unless anybody has any other cloud topics that they want to bring up.

[Enrico Fermi Institute] 12:53:30
I think this is the last session that's focused exclusively on cloud

[Enrico Fermi Institute] 12:53:37
Yeah, in the next session. We'll talk about some R.

[Enrico Fermi Institute] 12:53:43
And D things, and and networking

[Enrico Fermi Institute] 12:53:53
Okay, So maybe we break here and we'll we'll see everybody at at one o'clock.