Unknown Speaker 0:00 Later. Unknown Speaker 0:02 Oh, good, good. I just saw that the recording has clicked back on. Unknown Speaker 0:09 Are you guys good with me jumping in? Unknown Speaker 0:13 Okay, I'm starting. Unknown Speaker 0:16 Alright, so, uh, let me reiterate what's been said before, thank you to all of the organizers and facilitators who have worked so hard for the last months, and even going back a year, year and a half again, when we first started really, really talking about the details of this, and thank you to all the participants who are taking place in this workshop, you are scattered across the world, you have been working on the PRI exercises, you have been a gracious and patient with us as we have been working on everything. So thank you very much for that. I've got a few slides that I'll show you to lead into the exercise. But I will let you know that kind of talking into a quiet computer screen can be a little bit disconcerting at times, and that's fine. But can we just take one moment because I would love to hear everybody for a second. Can everybody just for like five seconds, unmute and just make a mighty noise. So I can hear that people are out there. And we can use again. Let me hear. Hello. Unknown Speaker 1:36 Ah, Ciao. Ciao. Unknown Speaker 1:41 Niihau. Guten. Tag. Good. Morgan. Probably Unknown Speaker 1:49 is great. Unknown Speaker 1:51 Um, I got Yes. Thank you everybody for that. That gives me a sense of that, that everybody's out there. So let me jump into the data set scouting. I'm going to take a moment to share some slides in find it. I have too many windows open. And so finding the slides is always Unknown Speaker 2:18 sorry, I apologize for this. Unknown Speaker 2:22 Let's see. Is that it? Oh. Unknown Speaker 2:28 Ah, here we go. Let's see if this works. Unknown Speaker 2:33 So Can someone let me know if they see a shared slides with data scouting? Is it showing you the whole screen or just like part of it? Perfect. It's perfect. Unknown Speaker 2:45 It looks like it shows the whole screen? Unknown Speaker 2:48 Yeah, perfect. Perfect. All right. So let me talk talk a little bit about data scouting in just a moment. First, I want to remind everybody, or maybe, maybe not remind, but just tell people about a few things. So there was a posting earlier, people have mentioned this, fill out these Google Slides, to introduce yourself to everybody, they've been posted in a few places, we would love to meet everyone. It's been mentioned by others speakers. Sometimes the best part of a workshop or conference is just talking to people taking a coffee together talking, we don't have that. But we would still like to meet everyone. And so fill it out. It's been interesting for me to even see some of the backgrounds of some of the other facilitators. And it's the best we can do under these circumstances. The other thing is a little bit newer, if you go to mattermost. And you go to the town hall, there is a link there to a Google Doc. And the organizers and the facilitators, we would like to improve the data, the Open Data effort, the Open Data effort experience. And so we'd like to kind of keep a running set of notes of things that we want to go back to. And sometimes it's difficult to really curate this from mattermost. And so if there's a particular suggestion or a particular problem that people had, if someone ideally maybe even a facilitator can just take a moment, maybe they put a link to the the matter most the thread or summarize it, and then we can go back to it. So again, if you haven't, you don't have to do it now. But if you can add an introduction to Google Slides, and if you get a moment, open up this Google Doc, for like quick questions. mattermost is still the place to go. But if we find out that 20 people have had the same question, we want to make a note on the Google Docs, so we can come back to it. Okay, let me jump into the exercise proper. So, this exercise is going to try to help you walk through finding the data. So by data scouting, we mean that first order looking for the data? What's in it? Where do I find it? What data is even available to me. And I realized that the majority of the participants are not experimentalists. And so there's a lot of jargon that as experimentalists we get used to. And so I'm going to try to explain what some of that is. And this is, in the in the data scouting lesson, if you're if you're following along with that this is a part of what the introduction is. So you're going to hear me and other facilitators refer to collision data. And when we say collision data, we're talking about data from the actual machine from the CMS experiment. And oftentimes, we don't even say collision data, we just say data. And for the most part, depending on the context, we know what we mean. Now, Unknown Speaker 5:58 if you're looking for data to run your analysis on and you've got this brilliant idea, you want to test it out, you might say, Okay, what year data Am I going to look for because the data are labeled by year, the data sets that year tells you how much data was taken, and what beam energy was used. So the data that's publicly available, and the data that you'll be working with right now is at 70, V and at V center of mass. And so depending on what your analysis is, you may want both, you may want one or the other. And then in red here, and I tell the size, for those of you that might have difficulty distinguishing colors, I wanted to point out something that's more unique to the collision data. And that's that the data sets are broken up by triggers. Now, there's going to be a whole lesson about triggers. But what we're referring to our selection criteria on individual events that take place, virtually at the time of running, this may be something in hardware, this may be something in a very, very quick software decision that's made. But we decide what data to write out based on these triggers. And depending on the physics analysis you're interested in, you may want some subset that has always identified a high momentum muon, or has always identified two high energy photons, or has always identified a large amount of energy deposited in the colorimeters depends on your analysis. But when you're looking for real data, the data sets will be broken down by triggers. Now, we also have Monte Carlo data. And this can be confusing because we use the word data sometimes to refer to both of these, but we try to be careful to say, there are also datasets where we have Monte Carlo, this is simulation data. So you have some physics process that you may be simulating, maybe you're using mad graph, maybe you're using some other some other tool. And then you want to see if how CMS would reconstruct that. And so again, we have a bunch of simulated data, it's broken down by the year, which tells you what beam energy, and then we try to generate enough money Carlo that you can do a consistent analysis, based on how much real collision data we talk. And that the thing that's different in these data sets is they will be labeled by what physics processes were simulated. So we have the ability to simulate the production of a Higgs Higgs decaying to something maybe something exotic, top quarks, a whole bunch of other different physics processes. And we'll go through and see what some of these physics processes are. Okay. Now, how do you find this data? How do you find what's even available? Well, everything is is there on the portal. I'm going to explain why we're using the portal to find this in just a second. But I just want to emphasize it is all here. And what I'm going to do probably for the next 30 to 40 minutes is walk you through the search options of how to find this data. It is not always immediately obvious, especially to first time visitors. And I'll be honest with you, I learned a lot in producing this lesson for you. But it is all there. We're going to go and find it. I'm going to try to explain to you why we're using some of these search features of the portal. But before we then jump into that, I'm going to stop sharing for a second and either with people raising their hand that would be ideally. Does anybody have any questions to this point? I know it's not a lot yet Unknown Speaker 9:43 but Unknown Speaker 9:49 Okay, then, I will just take a moment now and swap for a second. Unknown Speaker 10:06 So, Unknown Speaker 10:08 once again, I will just ask if people can see I have a browser page open with the data set scouting lesson on it. Unknown Speaker 10:18 Anybody can let me know if it's visible, I Unknown Speaker 10:22 can just jump in school. Unknown Speaker 10:25 Excellent, thank you. Unknown Speaker 10:28 So Unknown Speaker 10:29 there are four modules. The first one, I'm just going to give you a, it summarizes what I told you. But I'm going to comment on a few other things. And then we're going to walk through and we'll go through where the data sets are. And at that point, I will leave the lesson and I will actually bring up the portal and walk through it with you. We'll talk about what data and Monte Carlo are available. And at the end, we will actually look in one of these data files, using either commands that you can run either in Docker or the virtual machine. Okay. But there are a few things here in the introduction that I did not have on the slides, and I want to comment on them. Unknown Speaker 11:08 So Unknown Speaker 11:12 as an experimentalist, we are very used to let me let me We are very used to that sometimes doing things can be challenging, we convince ourselves that this is the right way to do stuff, the right way to find stuff. Instead of eating like this, we eat like this. And there's usually a reason for that we're not just making things arbitrarily difficult for you. So let me explain why we want to make sure that everything goes through the portal. The data that you will be using has been vetted and calibrated and blessed by a variety of people throughout the stages of analysis, both when it first came off of the machine and off of the detector, but also when it was being prepared for open data and open use by you. Like Kati said, This effort has been in the works for a long time, I've been involved with these discussions about open data since 2009 2010. And one of the concerns is that people will use the data improperly. That is they will use the data not fully understanding it, or using poorly calibrated data. And then they might find something that's not really there. So when we release the data, we want to make sure that people are using the right thing. And that people are that we know exactly what they're using. So that if there are questions, we could go back to it. So all the data sets that you're going to wind up working with have a deal is associated with them. This is a digital object identifier. Unknown Speaker 12:49 And Unknown Speaker 12:51 it allows us to backtrack through the entire process of calibrations and processing. So that if you come to us and you say, Oh, I found new physics in this data set, we can say well look exactly what record were you using. If you go to then publish it, you can use this theoi It's a way to cite the data. And if you go to, if you go to the the the lesson plan, I've just done a quick screenshot of one of our data sets. And I've asked you to try to find the the DI but you'll find it listed right here. assigning a digital object identifier is not unique to CMS or to the data sets, it's used pieces of music that are on the web. blog posts are software, whole software Suites have a DUI associated with it. And we need this we need this to be able to keep track of everything, as well as to be able to give credit to those people who who worked on on producing this data for everyone. And then the other term that you will hear people use is provenance. And if you're not familiar as much with this word, it refers to being able to find the history, the lineage of something, it is not specific to the data, it is specific to any general object or thing or concept that you're trying to trace the history of. And so you will sometimes hear us refer to the provenance of the data just being able to track through when it came off the machine. How is it processed? What was the software release? What was the calibrations database and so on? And then we we can say, Oh, yes, this was done properly or not. And so it's for these reasons that we're going to make sure that we do everything through the portal. Okay. Unknown Speaker 14:39 All right. Unknown Speaker 14:40 So at that point, I am now going to be going to the second module and the lesson we're going to go through Where are the data sets. And I'm going to actually walk through like I'm going to see if I can do this in real time. If I can follow this steps and then I encourage you to follow this steps as well. So at that point, I'm going to stop sharing. I am going to keep my own lesson, cheat sheet on the side. And then I am going to actually bring up the portal. Unknown Speaker 15:22 Okay. Unknown Speaker 15:24 Okay, so I apologize for this. But I will ask again. Can someone let me know if they see that I have the open data portal up? Unknown Speaker 15:35 Yes, thank Unknown Speaker 15:36 thank you, Edgar. And I will give everybody 30 seconds or so if they want to bring it up as well. This is open data.cern.ch. If you just Google CERN open data, you can come to the portal as well. And hopefully, everybody will. If you haven't looked through it before, that's fine. We're going to we're going to take a stroll through the data set. And Edgar, would you like me to make sure I finish this by 1030? So that we have a little bit of a buffer? Unknown Speaker 16:14 Yeah, sure. Okay. Thanks. Unknown Speaker 16:20 Alright, so hopefully everybody has this open, you can go here and you could just start typing, you know, Docker, you could type Higgs, if you wanted, you could type you know you on. And that's one way to go and start finding stuff. But if you're not sure, exactly what you're looking for, it can be confusing. So let's start off by going to CMS, which is right here. And let's just see what happens when we click on CMS. And it may take a few seconds to come up. But right off the bat, you come to this, this landing page where you find a whole bunch of these are individual records in the terminology of the of the open portal, these records may just tell you stuff about CMS, maybe documentation about getting started. And then eventually I don't know if there's anything on here, but there are datasets like so right off the bat, you say oh muons and electrons in Pac candidate format derived from such and such. You may not know what that means at first, but, but that's okay. When it first comes up, there is this sidebar here. And we're going to be using the sidebar to track down some datasets, we're going to see what data there is. And to first order you might see things like include on demand datasets, don't click on that we will in a little bit filter by data set. There's all these numbers here that tell you how many records are associated with these tags tags of data set. First thing I'm going to do is I'm going to collapse these just because it's a little bit confusing at times. And as you scroll down, you can see filtered by the experiment. So the only thing that's selected right now is CMS, I would just like to point out that there are 3900 records from CMS. And the other experiments are, I will be kind of say that they are catching up to 3900. You can filter by year. So 2010 2011 2012 this is what we're working with. You can see that there is data eventually that we are starting to think about 2016 2018 and 2019. You can filter by file type some of these you may not understand yet, we will get there. Collision type, and you can filter by collision energy. And you see that most all the most of the information is about the 70. Evie, an ATV. And then you can filter by category number, signature. These are now physics processes and keywords. But I'm going to like I said I'm going to follow the lessons. And so one of the first things that we're going to do is let me I have to bring up the lessons from my my cheat sheet. So what we're going to do is we're going to look for data sets for 2012. Unknown Speaker 19:20 So again, I have looked for I don't know why I'm getting an invalid query Unknown Speaker 19:26 is on select some of these Unknown Speaker 19:30 validation Unknown Speaker 19:32 framework software. Unknown Speaker 19:37 Okay. Unknown Speaker 19:39 So the only things that I ideally have selected right now are dataset, CMS. Unknown Speaker 19:48 And then I'm going to select 2012. Unknown Speaker 19:54 I'm going to re collapse these so I can see everything So I've selected data set. I've CMS selected, and I have 2012 selected. Now, as I mentioned, collision refers to real data from the machine. This is real data that might have new physics in it just waiting to be discovered. There's also derived data sets, these are data sets, we're not going to talk as much about them today. But they are data that has been reduced in some format for maybe event displays, visualization, outreach efforts. And then there's simulated data. And this is the Monte Carlo data. This is a Monte Carlo simulated data. And you can see already over here, there's information that might be more of what you're looking for tracker hidden wrench route files, you can see that this is a drive data set. And so we're going to work with just data right now. So I'm going to unselect simulated, and I'm going to unselect derived. And so the only thing that I want to look at right now are the collision data sets. Now, with these collision data sets, and there's a whole bunch of them here, right now, I'm only showing 20 results, but you can choose to show 50 or 100 or more, I'm just going to keep 20 right now, because it loads a little quicker. And all of these data sets have three fields that are separated by slashes. So one field to field three field. This first field tells you what trigger it is. And we are working to produce a better human readable list and understandable of these of these triggers. But for now, if you see something like double electron, you should get some sense that Oh, there was a requirement that when the data was taken, these data were only written when there were two electrons that pass some threshold. Now there's gonna be a whole lesson about triggers and about finding information about triggers, I believe that might be even after this. So you'll get more information about this. You only a part no BP TX, I actually don't even know exactly what that is photon handdrawn. that required a certain higher energy photon as well as some deposition of energy in the colorimeters. And so that's the first field and this is specific to collision data. The field after that tells us that this was from run 2012 B, the V one tells us something about that when it's processing. This is not surprising that it was 2012 because we selected 2012 data. Now within that year, the data was probably broken up into a BCD different run periods. And sometimes those run periods are very short. There may have been problems with the machine, you may not get every letter of the alphabet, because we may have decided this data is not appropriate for either analysis, or for the open data because of issues extra corrections that need to be done. So for instance, down here, run 2012 B 22. Jan 2013, tells us something about when it was processed. And this last field tells us about the data format. This raw field right here says that look, this is data from the machine. This is what we would refer to colloquially as hit information information about the tdcs the ADCs. Really what are the what is the information about the signals coming from the detector. And depending on what you're doing for your analysis, most of you will not be interested in this information. But most of you will be more interested in our data sets that have a od eo D stands for analysis object data. And these are files where you have for vector information and information about where energy was deposited in the detector, the majority of you will work with eo D data. And in fact, this whole workshop is structured around the eo D files. These are real real data files. Now if you click on any of them, let me see if I can find an eo D I'm gonna click on this one. You can find more information on any one of these. You can see information about the size of the dataset. This is let me see 17 million events. It is spread out over 1549 files. That's 5.4 terabytes in data. So it is a significant data set. There's a deal I up here. So if you wanted to cite this dataset, you can reference that Unknown Speaker 24:47 there's a bit more information about how these data were selected and now you find something human readable once you dig down into the datasets. events were stored in this primary data set were selected because of the presence of have at least one photon and high missing transverse momentum from jets or one photon, and so on and so on. I will let you read that if you want. There is information about the h o t trigger pads, these are the specifics of the trigger. And again, you will learn more about this in a subsequent exercise. And then there's information about how they were validated links to how you can use these data. Now, if you wanted to just download all these files to your desktop, it's not as easy as quickly clicking one of these downloads. Because, oh, this is only 85 kilobytes, this can't be the full data. But if you click on any of these, let me see if this will open up for us what these text files are our locations for where these data are stored. And you see that there is an incredible amount of files here. They're all root files that ideally, as you have become somewhat familiar with. You can see photon had in the the name, ao D in the name. And again, all of these individual files have some subset of those, what was it like 17 million events or so in them, these 1500 files. But this is a way if you wanted to just let's say examine one of these files, inspect it, see what's in it. This is where you would go, this is where you would go to find one of these root files to take a look at. Okay. Unknown Speaker 26:27 All right. Unknown Speaker 26:28 So let me now go back. So I'm back here at our kind of main search, if you will, we have CMS selected, we have 2012, we have collision selected. And I'm going to unselect collision. And I'm going to click simulated. And now we're going to look and see what what Monte Carlo there is. Unknown Speaker 26:59 So Unknown Speaker 27:04 there we go, just collapse those I have data set, I have CMS of 2012. And in the data set, I have the simulated data. And these are now the naming scheme is slightly different. There's still three fields and the three fields are separated by these slashes. The first field is no longer the trigger. The first field is the physics process. What you'll learn in the next lesson is you can take any of these physics processes and apply a trigger simulation to it in other words, you could see how much would this physical process how much of it would have survived the selection criteria imposed by this trigger or this trigger this trigger. And this way you can find out what trigger might be give you the most sensitivity. So this one right here says dry jets to L L M dash 50 tune z to star ATV mad graph tarball as an experimentalist I am sometimes embarrassed by like showing you how we're doing all this data storage we get so used to it, learning how to parse these and I don't always know how to parse them all. But let's see if we can figure out what this is. So this is DIY jets to Ll This is a drill Yon process, probably with some additional jets in the process. Drill Yon is a cork anti cork to a virtual zero photon that then decays to leptons. And so that's the two leptons M dash 50 I would assume looks at the mass of the by leptonic parent says let me require that this is greater than 50 gV. And then there's information about what was used to produce this it looks like it was mad graph at the data. In some of these data sets, you'll see more information information about what version of pythia may be was used. So that's the first field the second field is what we refer to as a global tag. So this says this was processed summer 12 no pileup information here you are not as concerned about this global tag. As experimental as we may see these global tags change over the course of a year as we go through reprocessing but for you we have given you a finalized blessed version of the data. And then the last field is is the data format and you'll see that this is not a od it's a od Sim. This is designed to distinguish between ao D files which are from the detector collision data and then simulated MonteCarlo. You can analyze them the same 90% of the format is exactly the same but you may see additional information in the simulated files for instance, we tend to save information about About the original four vector part Tom level information that was used to generate these data. So again, you can click on one of these, and I'm going to go into this one. So this is our daily on jets to buy leptons can find sometimes information of this, you can see the description of the simulated data set names and about data sets, you can find more information here. It was basically what I went through. And these should give you some more Unknown Speaker 30:33 more information about how to interpret these. Unknown Speaker 30:36 This is 1.1 terabytes, 348 files, 4 million events for this process. And if you go down to the very bottom, you will find that again, you can find a list of all the files that make it up. So most of the information on these pages is the same whether it is data or Monte Carlo. This is a great way for you to start to explore the data sets and figure out what's out there. But most of you will not really download all these data, you're not going to come to one of these text files, download the text file, then run a command to copy everything to your machines. That's not generally how you do it. What you will do is you will use some tools that you'll be learning over the course the next few days, those tools will actually be making use of things like these text files. And so we've set them up, we know how to access them, you will learn how you can access them. And then whether it's in a virtual machine or in Docker, you can run over all of these and produce some skimmed amounts of the data that you will then analyze, maybe in the Docker or virtual machine environment, or at that point, you will take it out of that environment, and then you'll put it onto your local cluster. But what I'm showing you is kind of a bit behind the scenes of how this works, and how all this how you can find all these things. So that's how to walk through the portal, I will mention one other thing if you want to start cutting down even more, if you look for the sidebar, you find these filter by category. And you can find this information here about what data what physics has been simulated Higgs physics standard model, the drill Yon that I mentioned, minimum bias is no trigger, I believe. And then if you go to top physics, which is what I work on, so I'm just going to click on it, you can find information t gamma t t bar. Some of these is TT bar plus jets. And again, you can find more information on the on the portal about this. And just poke around with us. Before we actually go look at any of these data. I'm going to stop for a moment and see if there is Oh, David, you. I appreciate there's been some comments on the zoom chat about specifying what some of these fields are. Yeah, I do not all know all of these. So thank you, David, for for chiming in. This would be an excellent time for me to emphasize that if you have questions about anything that you're finding on the open data portal, mattermost is the way to go. As Kati said, we want to create a community where people can learn from each other. So even after this week, if you're trying to understand stuff on the portal, you will have access to mattermost emails of people. And we can create a community that can work together and help each other and learn from this. So again, before I jump into kind of taking a look at one of these files, does anybody have any questions? And if you do, I would ask that you raise your hand. It's very possible that while I was talking, everybody was asking stuff on mattermost and getting your questions answered. So there might not be anything? Uh, yes, and I apologize if I mispronounce this. But the algo and if you can unmute and ask your question. And if you could, if you could tell me how to pronounce your name first, before you ask your question. Unknown Speaker 34:16 And do go do was felgo. Unknown Speaker 34:19 Thank you. Unknown Speaker 34:21 It's a simple question. So to to Unknown Speaker 34:25 produce Unknown Speaker 34:26 a eo D simulated sample with either fool's some some kind of to simulation, a simulation. Is this a possibility for? Is it open source? Or do we have to use something like delphis? Or can we can we generate this kind of format? Unknown Speaker 34:49 Yes. So the short answer is yes. Edgar, it sounded like you were going to jump in. I will let you jump in after I just say yes, but go ahead. Unknown Speaker 34:58 That was it. A big Yes. Unknown Speaker 35:01 So there is information and Edgar can correct me if I'm wrong, but there is information on the open data portal, I will let somebody add it to either the group, the zoom chat or mattermost, that walks you through the process of how to generate your data. So you might start with the hoosh files from Mad graph, or maybe your own, you have some other tool that you're using. And then there are very specific processes, you have to walk through very specific commands to like, you know, use some hadwin ization tool, run it through the full detector simulation, and then reconstruct it. But yes, the instructions are there that you could do your own thing. It may take a while. And there have been discussions about maybe automate this process. But yes, the instructions are there. And I will ask, if anybody is able to add them at some point, maybe we can share that. Good. Thank you. You're welcome. Unknown Speaker 36:00 Okay. Unknown Speaker 36:02 All right. So I am going to now up, there are no other immediate questions. And we've got about 10 minutes, which I think is the right amount of time. Let's see, I'm going to go back to so I'm now back in the lesson. And I'm going to go to the last one here. Which is what is in the data files. So you know, you've you've started messing around, or you have some students and you want to tell them, hey, just what is in these data? Are there muons? Are there Taos? Are there detector hit information? Is there the raw for vector information? How do I find it. And so you may want to just take one of these route files and very quickly inspected. There are different tools that are out there for this, we have a whole suite of tools that are prefaced by EDM, the event data model, and they're designed for very quick and dirty again, examining inspecting of the files, looking at the size of the files, and so on. You'll be introduced to even more of these in some of the later lessons, but this is a first order look at them. The most basic one, to my mind is the event EDM dump event contact. And so if you give it one of these very large paths, or if you have a route file that's local, if you give it one of these route files, you can dump information, I'm going to make my text a little bit bigger for those of you that may not have it on their screen. But if you use the help option, you say information like pick out branch names. Now branch is a terminology that uses to specify some common set of information, let's say about the muons transverse momentum, or some energy deposition in one of the detector elements. And so I have a command here that you can run either in the virtual machine or in Docker, notice that the command is at the beginning. And then most all this text is one of those really long file names. Rather than going and grabbing one and having you guys go and pick one of those and then run it, I gave you one sample one here, it looks like the one that I chose is a TT bar plus jets, semileptonic decays, done through mad graph at a TV and then a bunch of information that gets us to just one file. Now the amount of information that it dumps to the screen is large and you're welcome to dump it to the screen. But I've given you a sample command that writes it out to a log file. And then you can use whatever command you prefer less more cat, vi, nano, Emacs, whatever to look at the file. So let me go ahead and do that. Now I have a I have a Docker container open. You can either do this in Docker or in another virtual machine if you prefer. I just have to find out which one of my Unknown Speaker 39:28 windows that I can share on zoom. Unknown Speaker 39:38 Sorry. Yeah. Unknown Speaker 39:52 Alright, I'm just going to share my desktop and we'll see if this Transcribed by https://otter.ai