Unknown Speaker 0:06 All right, Unknown Speaker 0:07 we're good. Unknown Speaker 0:09 We on the recording? All right, excellent. Yes, I'm so alright. So welcome back everybody to day three of our open data workshop, we have some really, really interesting and what I think are cutting edge approaches to using the open data, both for just the computing analysis in the cloud, as well as some new tools that might make analysis easier down the road. Before we jump in, I have a few announcements I want to make, we'd like to get a group picture of everybody in today's new age, that means a picture of all of us on zoom, maybe even a couple of pictures on zoom, where we scroll through all of the visible people, we're going to do that at 930 Central time. So one hour from now, we'll pause the the lesson, we'll get a picture of everybody use that as a moment to just take a few minutes and stretch, and then we will come back in. So that'll be an hour from now, if you feel you want to change the background, or change your shirt or something or get a quick haircut, you haven't Howard to do that. And then later on throughout the day, all our lessons today are going to be synchronous. That is we will be regularly engaging with you throughout the day, the facilitators will be available and working. And we'd like to do this because there's a very important part of the workshop that's coming up in the afternoon, again, depending on wherever you are. If you look at the schedule at 1500, at three o'clock pm in Central Time Zone we'd like to engage in we've blocked out two hours for a discussion about your impression of the Open Data workshop, as well as the open data portal and the releases itself. We're going to be sending out a survey within a few days of the workshop ending. And we really, really, really desperately want to hear your feedback, what you think would make the workshop better what you think would make the open data and the access to it better. But we want to have kind of as much of an informal discussion as we can towards the end of the day. No comments, no questions are off limits. We really want to hear how to make this useful for you. Okay, so that'll be at the end of the day. We hope that everybody sticks with us. It's been a long three days, you should all be commended for everything that you've done. And again before and one more thing before we jump into stuff, I want to take a moment and recognize my co organizers, Edgar and Kati, in particular, who have really done the lion's share of the organization and work on this. If it wasn't for them, this would not have come together. Kappa DLA and all the other LPC organizers who have worked behind the scenes to do this, as well as all the facilitators who put in an incredible amount of time, over the last few weeks in the last few months to bring this together. So all credit and all thanks to them for the incredible amount of work that everybody's put in to bring this together. Okay. I'm now going to hand it over to Clemens Lang and Adelina who have put in, who put together I'm very excited for this lesson of how to run CMS in the cloud. This is really new. I'm excited to go through it. Clemens is a researcher at CERN, who has been really driving a lot of the work behind the scenes on the open data and solving problems for us. I cannot sing his praises enough. I am very excited to see what they put together. And I'm going to step back now and turn it over to Clemmensen Adelina, for their presentation. Unknown Speaker 4:07 Great. Thanks, Matt, for the nice introduction. Unknown Speaker 4:11 Okay, I will need permission to share my slides. Unknown Speaker 4:51 Okay, now it looks now a second. Okay. So yeah, so this presentation and This work in particular been prepared by Lena me and with lots of alpha and beta testing by kuti. So, the, we're going to be talking about running CMS analysis in the cloud. And when it comes to the cloud one basically starts to talk about combinators. Right right away. And this is basically taking all the stuff that you've learned the past few days, on using containers and deploying them in some way or the other, to different levels. So covenant is basically a tool to operate containers and manage them at at at larger scale. So that allows you then to run real and physics analysis workflows on public compute offerings. And I mean, it's also used for lots of other stuff, but for the purpose here, we're going to use it for physics analysis. And, yeah, I mean, you can read a lot about kinetics. It's an open source project, I think it's by now six or seven years old. And Adelina will be talking about what this is about in a couple of, let's say, technicalities, that that you will have to know so that they don't get lost with the with the wording on when we get into the meat for later on. Unknown Speaker 6:25 Okay, great. Thank you Clemens. So hi, everybody, like Kevin's just explained, he will use is an amazing tool when you want to, to get rid of all eliminate all the manual processes involved with deploying and scaling. And instead, you want to have it automatically done for you. So it's used, it's a it's used by a lot of big big companies for scanning in different containerized applications, but we will also utilize it for our physics analyzers, which you will see in this in this workshop lesson today. Unknown Speaker 7:04 If you take the next slide Unknown Speaker 7:07 few Unknown Speaker 7:10 words that you will hear in in today's lesson, so just to go through them, so they sound familiar and you have a better idea of what they mean. So, we often talk about the Kuban it is cluster which is fabulous probably the major piece of of the whole operation. So, since effect is just a bunch of nodes it can be virtual machine. So we can actually it can be actual computers hardware, where you can run your containers applications or in our case our physics analyzes. So when a problem is deployed on the cluster on the Kuban it is cluster, it is automatically distributed to all individual nodes. And ideally, US maintainers we don't have to worry about which workload is is performed on which node. This is all scheduled by the Kubernetes cluster. Unknown Speaker 8:02 Get the next slide. Unknown Speaker 8:05 Another thing you'll hear about is the Cuban dis pot. So you have been learned about containers. Docker containers should be something familiar by now. And a Cuba news pod just represents a group have one or more containers which are running together on a node. So we use pods to allow for easy communication between the containers. And it's also the component component that allows load balancing. So one thing which is computer cumulus clusters are great for is that they are scalable, and they allow for more resources if your if your cluster is under more load. And these pods are a big component of this. So you replicas of the pots will be created if there's more load put on your cluster and just making it so they always make sure that the costs are staying healthy and your application is is running or your physics analysis isn't being hindered by any more load put on put on the cluster. Unknown Speaker 9:10 Next slide. Unknown Speaker 9:13 So, the next thing we mentioned is the Kuban is job. So, a job creates one or more parts to perform a particular operation and the job also tracks the overall progress. So it updates the status because as mentioned the the the cluster runs multiple pods at the same time parallel or in sequence. And the job just tracks the execution of these pods to see which are active which has exceeded which have failed and ensure that the task which were set for the pod finishes and and when the specified number of pods has successfully completed for the job and the job is complete. Okay so To be able to interact with our Kubernetes cluster, we will use this command line tool, cube CTL, which I think stands for cube control. So from our point of view is just the cockpit control capabilities. Technically, it's a client for cumulus API. So all the operations on Cuba news has are exposed to API endpoints. So cube CTL takes care of all these HTTP requests. But what we're going to be using it for is we're going to type it into the terminal to manipulate our cluster in different ways. So here's two commands, there's the cube CTL, get pods, which is going to show us the pods which are currently in our cluster. And there's the cube CTL create minus f task to be executed that yamo, which is gonna execute one of our demo files, which you will also be introduced to students. Okay. The next thing, there is Argo, which will be mentioned here. So technically, we could do almost everything we want to on a cluster with the cube CTL command. However, when things get more complex, and to make it easier for us, if we don't have to, you know, type in command after command after command in the terminal, we will use this Aga tool, especially our workflows. So it's just a collection of open source tools to help you get stuff done on your Kubernetes cluster. And we will use all the workflows to execute some complex job orchestration, which includes both running serial and parallel execution. And here you can see as well to two commands, which we'll be using both Unknown Speaker 11:48 on futures. Unknown Speaker 11:52 Right, thank you. Okay. Now, Lina already mentioned a bit about the cluster being able to scale and so on. And I think that's something that we should find out. But that's also something where you have to be somewhat careful. So in principle, as long as you have a credit card, you can scale up your cluster to a very large, to a very large amount of nodes. So that means effectively virtual machines, and then you have practically unlimited computing resources. So that the cluster that we will be using now is one that uses auto scaling. So we will start with something very small, where we can run a couple of tests. And then we will try to run a more involved workflow, which will require more resources than the one that we have currently provisioned into our cluster. And then this cluster will scale up to a couple more virtual machines. And once these workloads are finished, and we remove them, basically, we can then skate on the cluster manually, or this will even happen by hand Unknown Speaker 12:59 automatically. Unknown Speaker 13:01 And that's something that you need to be aware of when you're using this. I mean, it's really easy to get resources provision, but just keep an eye on what you're using, and only ask for what you really need. Cuz otherwise you just be paying. And I mean, there's, there's lots of pricing links here, for instance, for the for the Google Cloud example, which is what we're going to be using. So you don't have to worry about this for the current example. So say for today. But you can, for instance, see how much you have to pay if you want to have a machine with four cores and 16 gigabytes of RAM. And so that's built basically by I think that even by the minute, you can see the the price, then I'm also a four, by by our and by month, so you get an idea how much we pay. But usually, it's a, you will develop your code, locally tested, maybe even on your laptop, and then once you're happy, you can actually go and run the real thing in the cloud. And that's something you can do very quickly and you provision the resources. And then once you're done, you delete, you download the output and delete everything and stop paying immediately. Unknown Speaker 14:19 Okay, another thing that we're going to point out is, so we have lots of computing power. And one has to be somewhat careful when one now wants to run, say hundred jobs in parallel, and they all access the sun open data server, so this can be an issue. So we'll also tell you how you can actually get the data to where you are. So the idea would be not something we're going to show you this that you download the data sets to your cluster first. And when you can keep them that start isn't too expensive. You can keep them they also for a bit longer and then you can run over data in a much faster way. So you won't be limited by input output. But really just by the computing power. Okay, so just one important point. Yeah, yeah, before we get started, so lots of people already have personal or private or in work, Google accounts. So in order to make sure that that these don't interfere with what we're going to do when you accidentally, you know, get something into your account, which should actually not be there. And please, please make sure to use a private or incognito browser window on the following. Do avoid that, and really make sure you're just using the account. And we don't run into any technical issues because of that. Okay, so that's it. On the introduction, I think there were a lot of words that and terms that were probably somewhat difficult to understand, I hope that will become clearer as we go through things. Now. There's one more thing that we have to do now, which is accepting a couple of terms of use, etc, just to get started. And we will all do that using the Google Cloud console that is linked from from the slides here. So it's console dot cloud google.com. And in the top right, you will find a login button. And, again, please use an incognito or private window, and you log in, and then you will see a couple of Windows, and I will just guide you through these windows. So if you get stuck, or you get something weird, let me know, speak up unmute, and we can try to figure it out. Otherwise, we'll just No, go through the individual things that you will see. I'll try to do it slowly. That but if there's something that goes wrong, no, please interrupt me. Unknown Speaker 17:11 Okay, so Unknown Speaker 17:14 that you will all have accounts, named CMS dash G, and then a number starting with zero and two other digits at archiveone and calm. And that's an account that has certain privileges for this particular project here, you will just have to accept the terms of services to get started. So you can also put questions into the, into the chats and match them. Also, the CMS analysis in the cloud channel should now be there. So I will put the link also into the chat there, so that everyone can get started. Okay. So once you've accepted the terms of service, there'll be the usual Google account protection. And we don't need any recovery, fundamental throwaway accounts and no recovery email either. So you just click confirm, and continue. Unknown Speaker 18:29 Okay, then you will get a screen that tells you something about, well, it's going to address your CMS, it's going to ask you about your country, it doesn't really matter what you said there, you can choose them United Kingdom, USA, wherever it doesn't really matter, just tick that you agree to the terms of service and do not ask for email updates. And then just agree and continue. Unknown Speaker 18:57 Okay, once you've done that, you should actually see the Google Cloud Platform window. So again, if after doing all this, you basically lost track of where you are in the browser, just go back and hit console dot cloud google.com. And then the that you will see a banner at the top that's telling you that you have a free trial waiting to give you $300 of credit. Just miss this, I mean, you can use it later. Once you actually want to do this for real for for your own analysis. So once you put a credit card into account, you will have three months and $300 worth of resources that you can use for your studies. So the project, which is in psychology on the on the top left should already be sent CMS If not, click it and select it. So you will then see a window popping up like this. You should in principle See some CMS there. So just highlights some CMS and then click open in the bottom right. Okay, so if that worked, you will now see the slightly more interesting overview panel. So the dashboard, on the top left, you will see now that certain CMS is selected. And while you can see some some metrics, resources, etc. So we will all be using the same projects, everyone will be using some CMS. So, you will already see for instance, that there are resources provision. So there are already virtual machines there, which virtual machines that some people are already using now, for instance, Unknown Speaker 20:47 how do you get to the dashboard from CMS? Unknown Speaker 20:52 So, you should just go to console cloud on google.com. After you've logged in with your account, and one, if you've done all these steps, you should see the dashboard. After selecting actually not you get Unknown Speaker 21:04 you get to a screen where you see Compute Engine engine, cloud storage, Unknown Speaker 21:12 you can go to home, and then dashboard. All right. Unknown Speaker 21:16 Yeah, you can. Yeah, thanks. So the top left click on home, or simply click on Google Cloud Platform right next to the sun CMS. Unknown Speaker 21:31 Does the work? Unknown Speaker 21:35 That's Unknown Speaker 21:45 okay. I take that as a yes. Otherwise, no, no. Yes, yes. Okay, perfect. Okay, so now, we will actually already move on to a different menu here. So on the left hand side in the menu, which is also something that you can scroll, there's an entry called Kubernetes engine. So you can click on that, and then click on clusters. And that will bring you to a page, which will actually look a bit more interesting than the page that I'm showing you here now, because there will already be clusters available. And we will now go through the steps to set up a cluster for yourself. So you're not meant to use any of the existing clusters. But we need to create one for yourself so that you learn basically how you can create your own cluster, should you then create your own account at a later point. Okay. So there'll be a create cluster, I can then under in the top panel right next to clusters. And now you have to be a bit patient. So do not click trades immediately. But really wait until we've gone through the screens here to make sure you have the right setup. So the first thing is here. Unknown Speaker 23:05 Wondering Clemens Can I interrupt you Sorry, um, said, I wanted to ask you really quick because I was getting distracted by trying to find it. And so do we have a specific public matter most channel then for today's public channel for today's one that I did not? I don't know that what I joined it, but I did not find it. Unknown Speaker 23:28 Yeah, it's called CMS analysis in the cloud. Maybe we have to put it into the Unknown Speaker 23:33 into the town square, maybe the link, if you could, that would be I think helpful for so that. Unknown Speaker 23:40 If not remember, you can hit the button, which says more. And you should see the Unknown Speaker 23:47 I think I just added the link to the chat and the zoom. Unknown Speaker 23:51 Also headed into the town square. So you should see Yeah, Unknown Speaker 23:53 Times Square, it should be at the best. Yeah. Because when I click that, you know, like viewing all channels. It didn't. I guess I was already in it without knowing. But um, Unknown Speaker 24:10 yeah. Okay. Okay. So, yeah. Unknown Speaker 24:14 Also, I'll also be monitoring. So you can also Unknown Speaker 24:19 tell me to slow down there. Yeah. Unknown Speaker 24:21 Yeah. Thank you. And then so here and then, because I was looking for that. So from the dashboard, how did you get to get to the Kubernetes? Yeah, to this from the dashboard. Yeah, it was down there. Yeah. The slides anyway. Right. Unknown Speaker 24:34 So this is Steve. I also, I sort of, I'm not finding an email that gave me the credentials to log into this. But I didn't want to interrupt because I thought that everybody else had But Unknown Speaker 24:47 okay, I send it only to those who reply to the questionnaire, but I can Unknown Speaker 24:54 send it send it to you. Unknown Speaker 24:56 I mean, I haven't been responding to the questionnaire because it's noise because I'm baffled. data and data. Unknown Speaker 25:01 Okay, so let's see, steve steve Myrna, yeah. Okay, I'm gonna send you a direct message with the credentials. Just give me a second. Unknown Speaker 25:14 Okay, there you go. Unknown Speaker 25:21 So I send it to you and metamodel you should see there. Unknown Speaker 25:26 Okay, let me get back to my slides. Okay, so you have to click on covenants engine and clusters. If you don't find that you can also use the search box and click Go on, and just type combinators. And then it'll actually suggest you use Covenanters engine or you can clusters there, and then you will directly get there as well. Okay, so now on to the cluster creation, as I said, you will already see a couple of clusters there. Since we're all sharing the same project. If you were to do this on your own, you would not see any project or cluster there. So that's why I added this screenshot so that if you want to do this on your own, with your own account, at a later stage, you see something that looks fairly familiar. So now what we have to do is set a couple of things. And as I said, Please do not click Create before we're done, because there are a couple of things that we need to adjust in the settings. So we're not going to use the standard settings, what we're going to change a couple of things. So the first thing is, you will have to have a unique cluster name. Okay. So as the cluster name, I propose, you just use the number that is assigned to your login. So that is CMS dash G, zero something@archive.com. And just use the three digits from there and append to the cluster name. So then you're in the example that I've given here, the cluster name would be cluster dash 010. Okay, don't change anything else. But do not click rate either yet. Once you've done that, click on the second item on the left hand side menu, which is the default pool menu under the node pools entry, or category. And here's what I mentioned during the introduction, we're going to use auto scaling. So, the size of the cluster, so the number of nodes wishes can use the Unknown Speaker 27:32 annotator here. Unknown Speaker 27:37 Let me draw. So the the number of nodes here should be set to one, then you should tick this box to enable auto scaling. And then just set the minimum number of nodes to zero and the maximum number of nodes, set it to four, which should be good enough for for what we're doing. Okay, and, again, do not click Create yet. But wait, we have to do one more thing. Unknown Speaker 28:09 Ah have to quit. Unknown Speaker 28:12 Quit this. Okay, now we go down the menu a bit further. So under the default pool, menu, entry does a note entry. And I actually have to delete the the notation you all saw. So we're going to use a slightly different machine type. So the default machine type, but just the two cores and four gigabytes of memory and these would be shared. This is too little basically for our workloads. So typical CMS stubley workload requires one full CPU and two gigabytes of RAM. But since there's a small overhead when you're using Kubernetes, because there are other services running on the on the node, you do not have the full CPU available for itself. So we're just going to use a slightly bigger machine. And so we're going to choose a machine type of type of type e two dash standard four, and that should give you four virtual CPUs and 16 gigabytes of memory. So that is in the two series that should already be selected. And then just choose e two standard four. And once you've done that, no need to change the persistent disk or the boot disk size. Just leave everything as it is and then click on Create. Unknown Speaker 29:41 Okay, so once you've done that, Unknown Speaker 29:45 now the creating the cluster will take some time basically, you know there's someone walking around the computer center and are trying to find a computer that they can give to you. Somewhere in the central us because that's the zone we selected. So you will see a spinning wheel right next to the cluster name that you that you chose, you basically have to wait until this cluster is provisioned, it will be a few minutes. But but not more. And while this cluster is creating, we can actually run a command to log in to the system already. Unknown Speaker 30:25 So there's a so called Cloud Shell icon in the top right. So this is this terminal icon that is highlighted in this slide. So once you click that, Unknown Speaker 30:39 you shouldn't you should see Unknown Speaker 30:42 a terminal pop up at the bottom. So this Cloud Shell terminal. And it will tell you something like Welcome to Cloud Shell type help to get started. And it's going to remind you which project you're in, which is sent CMS. And you don't need to change to a different project, because that's the only project we have. But one thing that we need to do is to actually log in to this project, again via the terminal. So this Cloud Shell. So that's all in the browser, just to be able to interact with the cluster also via the command line. And not just by clicking around because we will have to execute a couple of commands in the command line. So if you now enter G Cloud off login, written here at the bottom, you will get some output which will ask you to go to the following link in your browser as along HTTPS accounts google.com link. So just right click, for instance, this link, open it in a new tab. And then basically just click Continue, continue accept or something like that. And at the very end, you will get a verification code. And then you copy that verification code, go back to the this tab here and paste it and hit return. And that should log you in. And this is only something that we have to do once and from from then on things, things will be much easier. Unknown Speaker 32:22 So I'm going to give you a minute to do that. Unknown Speaker 32:51 Okay, so I'm gonna slowly continue assuming that you've managed to do that. Unknown Speaker 32:57 There is a question. I keep seeing Philip Engler has a hand raised. Unknown Speaker 33:02 Yeah, hi, sorry. I'm still at the stage where I should choose the machine type. And somehow I can't choose the two standard one, I only have the two micro it was small, and it will medium as an option. Unknown Speaker 33:21 Which are all two CPUs only. So you can scroll down further. Unknown Speaker 33:27 I can sorry. I'm so sorry. Yeah, I didn't see that. Yeah. Okay. Thanks. No worries. Okay, good. Unknown Speaker 33:45 Okay, yeah, so basically, at the point that you click Create, you know, and then let's say two to three minutes later, when the class is created, the clock starts ticking, which means now in principle, you would start paying for this machine. But, you know, this would just be a couple of cents to get started, because it's a few minutes, so it's nothing to worry about. And we started with a really small cluster. Once your cluster has been created, we actually see how big your cluster is, and also how many virtual CPUs you have. So for instance, in the example that I hear on that I have here on slide 26, you can see the location, your central one, see, which is one of the cheapest locations, which is perfectly fine. Sometimes you want to be in a location, for instance, that is, for instance, closer to the data. So I guess if you want if you didn't want to download data from the sun open data portal when you directly wanted to stream the data into your cluster, you could do that. And you would then probably want to choose a Europe central location because then the latency will be smaller. Okay, so the cluster size should be one and the total cost should be four virtual CPUs and the total memory should be 16 gigabytes. And then you go on and click on the connect button. So that should be the connect button for the cluster that you created. So make sure it's the cluster dash with the number that you chose. And once you do that, you'll get a pop up window. And that'll basically tell you that you can connect using the dashboard and the command line, we're going to use the command line. So what you should do is you copy this command, that'll be given there, you don't have to remember just copy it and click running Cloud Shell, most likely, This command will even be directly copied into the terminal. So all that you will have to do in the following is hit return. So just before you hit return, check once more that the command actually has the cluster number that you created. So it should be somewhat G Cloud container clusters, get credentials, cluster dash, then the number and then the stretch zone, your central one C and the projects and CMS. But the only thing you have to be careful about is the cluster number. Okay, and we'll be getting into these numbers. And you know, you will really remember this number one once we are done with this tutorial, because we have to use it in lots of places because we're showing this cluster, Unknown Speaker 36:41 just to avoid sorry, Clements, I am a little behind. So how did you get here? I already have in my terminal, but then you clicked on the cluster that we created. Unknown Speaker 36:54 Yes. Sorry, a movie was fast. So you managed to login, right? Yes, yes. Okay. Unknown Speaker 37:00 So then I think that I got, I got disconnected for some reason. And that was trying to see how do I reconnect and there's actually a button to reconnect. So I was dealing with that, and then I missed how you so my shell is already on? Unknown Speaker 37:15 Yeah. Yeah, close the shell again, just click on the X button in the top right. And then, in the cluster overview page, choose choose the cluster. Yeah, I mean, find the cluster and the list, or you can also filter for it, and then click on Connect. And once you do that, sorry, click on Connect. And once once you do that, you should see this pop up window, connect to the cluster. And if you see that, just click run in Cloud Shell, which will again, open the Cloud Shell but also directly paste this Connect command into the terminal. So the idea behind this command is just that. With this command, we basically set up which cluster we're talking to when we run the following commands. So if you open the Cloud Shell without running this command, or the nd you run, for instance, cube CTL. So this community's control command, it won't know which cluster to talk to, in by running this command, it'll set an environment variable, so that it can actually find the cluster to talk to. Unknown Speaker 38:30 I see. Unknown Speaker 38:32 I see a long list of clusters, but how do I know which is my Unknown Speaker 38:38 what you at the when you when you create the cluster at the very beginning. So you set a cluster name. So my suggestion was that you use cluster 010, in case your email address. So the Crunchyroll ninjas were seen as mg 01 zero@archive.com. Unknown Speaker 38:59 So sorry, Clement, I this. Well, in my case in my saga, Gods like talk, so I, I don't know if I can like reload the page. What should I do? I mean, it's not, it's not doing anything. Anything like 63% stock making for the last five or Unknown Speaker 39:23 so you could you can refresh the page that doesn't hurt. Unknown Speaker 39:27 Navigate back to the communities cluster overview and then connect to the cluster again. So did the other person that did you did you figure out your cluster number. Unknown Speaker 39:52 If you did not set the cluster number explicitly and sorry, in this step here, where you actually He said the name of the cluster, you will just have a running number. So your cluster number might just be two or three, or something like that. So it won't be a three digit number. So in case, you've done that you can connect to that cluster. And let's say, if there is another person who's also forgotten to set the cluster number, explicitly, um, you know, you could maybe put it into the mattermost, or even the zoom chat, just to make sure that we don't have two people working on the same cluster, which might be, let's say, somewhat confusing. Transcribed by https://otter.ai