Unknown Speaker  0:04  
Okay, so I we did so much in this first hour and a half. So where we've where we've come so far is is basically that we have taken this EDI analyzer, the eo D to nano EOD tool. And over yesterday afternoon, and so far today you've added some things to it, we've we've used it as an example of how to actually take the CMS files, load what we call collections. And, and as as c++ code objects, and use them to extract the physics. So I wanted to just show you before we move on, I did the very bottom challenge at the bottom of the page, which is to draw to draw some uncertainties for the jet PTO means zoom in a little bit. And so I've drawn the uncorrected PT, which is this solid line that's shoots up here from 15, because we put a put a selection on it. And then the corrected and the uncertainties are closer together here

Unknown Speaker  1:28  
in different colors.

Unknown Speaker  1:32  
So that's just one example of of looking at this output file, and using some of the root teatree commands that you learn in the pre exercise to investigate some of the differences here.

Unknown Speaker  1:45  
Okay, it doesn't want to zoom in, it's stuck. So, um, oh, there we go.

Unknown Speaker  1:54  
There we go. So we've got somewhat sharp difference between uncorrected and corrected because obviously, we put a selection on this and we didn't on the other. And then you can watch the uncertainties in different colors kind of wrapping themselves around the dotted lines. And so that would be one of the one of the methods we use for a systematic uncertainty would be to have shifted histograms like this. So I just want to say that as a kind of wrap up of the previous lesson. So are there any, are there any big problems that we can take a moment to, to solve

Unknown Speaker  2:41  
in terms of updating the code or having access to what you need?

Unknown Speaker  2:55  
All right, feel free to ask on mattermost. Of course, as we go through this, if if one of the code instructions is giving you trouble, the next two lessons are gonna look very different. So we're gonna, we're gonna do kind of two lessons, it'll kind of go one straight into the other here, which are focused on saying now that we have this analyzer to produce an output file with physics information it, how do we do that? And then what is the next step to actually ask a physics question and select events that will allow us to answer it. So we're going to talk first about pre selection, and skimming. So these are, we're going to start to get into words that mean something different to everyone. And that's fine. The goal here is basically to summarize everything that we've done up through producing these nano AOD files. So I'm going to start with the first, the first lesson here, excuse me episode here. And this is just there's no exercises, this is just an introduction for you as to how CMS processes their files. So the first thing we do, which you talked about yesterday, is we apply triggers. So this is just the reality of life as data comes off the detector, we must apply triggers to it to reduce the rate of data coming out of these collisions, because we cannot collect everything. And so we define a menu full of trigger pads, and events that pass one of those trigger pads will end up in what we call a primary data set. So that's the word in bold there. A primary data set is a collection of data events that have passed one or more of a certain set of triggers. And the example for this workshop is that we're going to look for a Higgs bosons decaying to tau leptons. So we would choose the primary data set tau plus x, we would say left There'd be towels. And so we would collect all the events that have towels. And the goal of this is not just I mean, obviously, we need triggers to slim down the data flow. But it also makes things simpler for analysts because because often, we are not interested in looking at literally all of the CMS data we're only interested in, in looking at data that meets our physics questions. Okay. So we've triggered we've we've applied triggers, and we have primary data sets. As I just said, Yeah, you know, it would be impossible to do you know, your physics analysis with every event, for every primary data set, you would be overwhelmed with CPU requirements. And the same would be true if if we said, you know it, well, here's your raw data, go for it. If everyone to do anything, had to apply all of the CMS reconstruction, and all of everything that we do, it would just be an overwhelming burden. And so we skim this data. Skimming can mean a couple of different things. And maybe you'll you'll use other words, maybe you say, no, that's not skimming, that's slimming fine. We could actually get rid of events. So we could produce a skimmer for some purpose that contains events that look alike in some way. Alternately, we could we could skim by getting rid of information, so we can simply make the files smaller, keeping the same event so we pursue kind of both of these paths for different purposes. The biggest gain in terms of getting the file size down and getting the CPU requirements down for users, is by removing extra information. And so at each step of the process, as we, as we do something with a file, perform calculations, you can then drop some of the inputs to those calculations, if if you anticipate that users will not need to make the calculation over again, which is very often true. So procedure, we start with raw files, and we reconstruct them. So that the biggest the first big change is from what we would call a raw file, I'm gonna go down to this little diagram. Yeah.

Unknown Speaker  7:16  
For example, if I want to scheme, an ad to nano ad, what is the procedure? For example, should I do for example, first ad to my mini ad, and then mini ad to nano ad, or can I just do a straightforward ad to nano ad.

Unknown Speaker  7:34  
And so I mean, you have in your hands example of starting and just producing and then we'll do file from it. So that's, that's literally what we are doing in this in this AOD to nano IoT example. So it is not required that you first produce many ot m, in terms of the actual when CMS is doing this centrally, for many users, that is usually the procedure, od produce many od produced nanoray od. And that's because, you know, there has been a script written that is supposed to be universal across many analysis needs that will produce all of these objects, and from them, the nanoco D tree, what we've shown you is a way to make something that's functionally similar to centrally produce CMS nanoco D, but trade from A to D based on what you want and what you need. So we'll circle back to that. We'll look at that literally, in the next episode, we'll look at everything we've done to go AOD to nano God, does that answer your question? Okay, so, as you're saying, first thing is to reconstruct. So, this is common sense, we take the raw detector information and we pass it through our reconstruction algorithms, tracking vertices, performing Particle Flow, all of these are done to make a Rico file. And so we've got this kind of graphic that for a couple of different types of objects tracks, or electrons jets, what information would be in the raw file compared in red compared to the reco file in yellow, and then even the next level is to reduce this to objects. So that's what we typically think of appearing in a pod and when we go from Rico to eo D, we would reduce some of the information in Rico to just objects and so that's what a od stands for analysis object data. So that you have Rico colon, colon, electron Ricoh, colon colon jet, you have these objects, and and not all of the behind the scenes information is kept AOD files are small enough for people to use. And this is what was done in run one small enough for people to use because the physics has been prioritized over the detector information with the assumption that, you know, all the procedures that went into reconstructing and creating these objects don't need to be repeated by you, the user, which is typically going to be true for people using open data, you're not perhaps interested in creating new reconstruction algorithms for CMS. Okay, so we have these empty files. And that, of course, is what we are working with in this exercise. In actual CMS, especially in run two, I should say, only in run to the concept of mini eo D. Was was begun, eo D just became too large. And so many eo DS are smaller, the file sizes smaller, because we've kind of gone through the procedure of making the patch jets. And so the corrections already applied, and they've already been selected. And a bunch of ID algorithms have already been computed and all of these things. And so the what's behind those can be can be dropped out. So those files are even smaller. And then in CMS, we typically from many od make a nanoco. d file, which has a completely different format, this flat route tree format that just has branches that we've been talking about. And so this is a demonstration of, you know, how it would be done for actual CMS analyses. But nano VOD files have this generic structure that we can make from it a lot of different ways. Okay. So while this is going on, the skimming, and the reduction of the size of the data set, things are also moving around in the world. So we start obviously, at CERN, all of the raw data goes on tape at CERN. And it's reconstructed. Usually in the in the tier zero there is reconstructed. And then we make copies of the raw data set, and also the reconstructed data sets that are stored at what we call tier one sites. So these are other computing clusters at a variety of places around the world. At Fermi at Fermilab, we have a tier one. So that's the probably the main tier one serving the United States. And there's a second copy of things stored there.

Unknown Speaker  12:33  
Sorry.

Unknown Speaker  12:34  
Yes.

Unknown Speaker  12:36  
Maybe I got I think it did the full picture here that you get the role data from the detector, which is a bunch of signal like electronic signal. And then you reconstruct of these raw signal you reconstruct the physical objects.

Unknown Speaker  13:00  
That's what you call recall.

Unknown Speaker  13:03  
Yes. So the best that the things like taking tracker hits and forming them into tracks. reconstructing from those tracks, vertices, and then taking all of the colorimeter hits, in addition to the tracks and doing the clustering for making clusters out of hits in the colorimeter. And doing Particle Flow candidates would all be part of reconstruction.

Unknown Speaker  13:28  
And the same can be done to multi Carlo generated samples.

Unknown Speaker  13:33  
That's true. There's a similar sequence for Monte Carlo. I just stuck to the data here. There's a similar sequence from Monte Carlo, we start with the generated particles, and then we do what's called digitisation, which is where we we get the simulated detector response to those Monte Carlo particles. And after that, the sequence would then look quite similar. As in taking those simulated, you know, digitized detector, it's and then doing things with them to get reconstructed values from the Monte Carlo.

Unknown Speaker  14:08  
So there's kind of an extra layer, I guess, at the beginning.

Unknown Speaker  14:10  
All right, thank you. Good question.

Unknown Speaker  14:14  
There is I think, where did I put the link?

Unknown Speaker  14:19  
Somewhere on here, there's a link, it might be just at the bottom here it is in the public workbook. So there's actually a workbook page that is in the CMS public realm, so you can look at it and I believe it has the similar analogy for Monte Carlo. Okay. So as we're doing all of this, we're moving things around. So by the time you get down to an eo D or a mini, oh, dear and nanoco D sample, you, the CMS analyst, probably have access to some files that maybe you wrote, or maybe CMS wrote, but they're stored on tier twos in your country, so it's quicker to access them because they're there. physically close to you. And the size is generally small enough that it's, you know, it's workable, it's let's say, to stream them via X ray or D, the way you've been doing with nano eo DS, the goal is that you might even be able to just put them on your laptop, and do analysis on your laptop. Trying to depends on how nice of a laptop you have. So that's what's happening to the data. And we're going to take part in the last bit of that. So we're going to look just set at what we're going to do to produce these nano God files. And of course, when I say we're going to produce them, I mean, you can produce them, but they have already been produced. So please don't stress out if if you're not able to produce new files right now, that will be fine. Okay, so we get into trouble. I've seen a lot of, you know, a lot of analysis presentations, where someone's asking, what selections did you make. And you know, the poor student giving the presentation will be I didn't select anything on this object, because all of the selections were hidden one level back in making their input file. And so since we're doing things at different levels, it can be helpful to summarize, what selections have we built in to our input files? before we actually sit down to say, okay, Higgs to tau, what do I want? So we're going to just walk through this really quickly and look at what we have done. One thing that we have skimmed a lot to make these nano God files is the triggers. I think you looked at this yesterday, we have this lit in our code of three interesting triggers. Where Of course, you can see right away that interesting is based only on which physics analysis we are setting out to do right now. There could there are hundreds of triggers. And all of them could be interesting for different people at different times. And so when we go through and we load, we open up the trigger results collection here from the input tag, and we start to loop over them. And one thing we're doing is we're asking, do we find the interesting trigger in this name. And if we do, then we check whether the event passed or failed. And if we don't, we skip. So we have, we have gotten the file size down a lot. Well, we've gotten the file size down by completely disregarding 99% of the triggers and keeping the information for only three. So this, if I was going to take this workshop example, and do a completely different physics analysis, that is one of the places where I would start would be to go back to the trigger lesson. Think about which triggers I want to come edit this part of the code. So in this sense, this nanog God is not universal. It is for one analysis, and one analysis only, as opposed to being applicable for many. So that's one type of pre selection, we have done one type of skimming, we've done this to get rid of all those extra triggers. The other thing that's hiding in our nanoray, od producer, our momentum thresholds, so at every block of code, every block of code here, we have a momentum threshold. So if we go and we look at, let's just look at the Jets, again, it's going to be AK five pf jets. Here, we we set this this is hard coded a minimum momentum of 15 gV. And that's the very first question we ask is the momentum high enough to pass my threshold. And if it's not, we completely ignore it. We talked earlier about also putting in some of the noise ID in here. So that would be another set of pre selection requirements on the jets to say they will pass the noise. And so we have momentum thresholds built into every object. These are kind of I want just wanted to highlight this for you.

Unknown Speaker  19:18  
They're kind of low momentum thresholds, actually. So so you know, compared to starting at zero, they're high, they're going to get rid of that really low momentum range where perhaps we don't do well, at reconstructing this object, or perhaps the Miss identification rate is just really large. But they are still pretty pretty low in terms of physics analyses. And so one example is even this reference page for the electrons in the photons, they recommend that their identification criteria be used only down to 15 gb. We've saved everything between five and 15. But that might actually not be suitable for analysis. Similarly for joining That's we looked at the uncertainty distribution, and it was kind of falling in momentum, there was a larger uncertainty at low momentum. And so once you get to about 20, or 30 gb we're kind of in the stable region. And so most analysts are recommended to use jets of 20, or 30 gb as opposed to going all the way down to 15. So there are pre selections that we build in. But then we may even make for this may not be the end of the story, we might later select jets have a higher momentum, or electrons have a higher momentum or something like that. And the last thing I wanted to mention is the generated particles. So we have many generated particles, obviously in in Monte Carlo, you guys know that even better than me. And and again, just like the triggers, we've decided, a priori, hard coded, which ones are interesting. And those are leptons and photons. So leptons and photons are the only things that have been declared interesting here. And actually the only ones that get saved, that gets saved in these branches, value jnpt values, and Ada are only the ones that match to a reconstructed object. So we've completely selected away everything that is not electron or a photon, and everything that doesn't match one of our reconstructed objects. So that's kind of hidden baked into this to this code, and you know, is completely customizable by you, the user. But it's just important to recognize recognize what's been put in. So the key point here is to, you have the power to decide what is interesting, but every time you change your mind in a significant way about what's interesting, you might have to reproduce and anody files. And so in the in the generic CMS case, they try to grab as many people as possible, maybe about 50% of people wanting to do analyses. But our example here is a little more customized.

Unknown Speaker  22:04  
So any questions about about any of this.

Unknown Speaker  22:08  
So if you compare the size that you produce to the official, quote, unquote, CMS size of nanohub u, what is the comparison,

Unknown Speaker  22:18  
or should be smaller, I haven't actually made that comparison numerically. But certainly our files will be smaller, we have we have many fewer branches in the tree. If we might be as low as 50% of the size, I did have to look. But they'll save most of the triggers, and most of the gen particles and things like that. And so it can build up the size. But you know, the anody size is still really small. So we're getting even better.

Unknown Speaker  22:52  
Okay, thanks. Okay,

Unknown Speaker  23:00  
so how to run your own This is going to be, this is going to be the the baby version, the quick intro version of how to run your own because tomorrow morning, you're going to have an actual tutorial on how to do this in the cloud, which will, which will be a really nice method. But I wanted to show you just a couple ways in case for whatever reason, that's not a great solution for you. Or you just want to do some, some testing without without going into too big of a setup.

Unknown Speaker  23:32  
So

Unknown Speaker  23:34  
let's grab this configuration file in our configuration file,

Unknown Speaker  23:45  
for simulation, also for data,

Unknown Speaker  23:49  
we have some

Unknown Speaker  23:53  
just show it here. We have been using this setup that just opens one file. And so we have told it, the source is going to be what's called a pool source, class name. And that's has a has a parameter that it can take in called filenames. And we have given it one file name. However, this is set up as a vector of strings. So this V string means it can happily take more than one value. We've just given it only one. And so we the first, the first basic way to run over many files is to put many files in your configuration and let it run over all of those files. So we have this example. You'll see this commented out in your config file. It says this is a says this is a file setup for lots of files, not a short test. And so that's that's the little snippet that's copied here. We can also load a list from a file. And so in the data directory of this repository, you have access, I think I actually have it kind of easy on my doctor here, you have access to a lot of text files that are lists of file names. So if I opened one of these files, CMS run 2012, C, tau plus x, whatever, just got to pick one of these text files. Here's a whole bunch of iOS public links. So someone has gone through what I think you learned on the other day of how to find the CMS files and has said, these are the samples I'm interested in, I dump all of the root files into a text. And so the configuration can be taught how to open this txt file. And then it could even be extended to open another text file. So this is opening two different text files, the one called 00, and the one called to 000. And so it will be an even doubly long list. And it will just run over all of them in the chain. So as you can imagine, this makes your job take longer, this makes your output file bigger, because it's 200 times the length and the size because you're running 200 files instead of one, or however many you tell it to do. Depending on the scope of your job, this could be a really good solution. Maybe not, it's it totally depends on what you're doing. If you're doing something really fast, yeah, just throw all the files at it. Great.

Unknown Speaker  26:53  
Quick question. So you

Unknown Speaker  26:58  
at some point, you have to define the data sets, the real data sets, and also the Curlew sets that you want to use. Yes. Follow the same procedure on both of them?

Unknown Speaker  27:11  
Yes, that's absolutely correct. All right. So this, this is an example of saying I am running TT bar, the sample and it has files in two text files.

Unknown Speaker  27:25  
Somewhere when I would need to repeat.

Unknown Speaker  27:29  
Okay. So one reason we might want to parallelize is because we're gonna have many samples. So to run all of my TT bar and then run something else, then run something else, I'm not super. So I actually would would like to see if we can do this experiment. I haven't done it myself. But you can get an idea, I think it will be different for everyone. If you set in your configuration file, if you will change where it says to run only 200 events, if you will set that to negative one. And if you will give it some different name. So it doesn't just overwrite the little test file that you've already made. If you will do that, and save and quit and set it running. I would be very curious to see what is everyone's time? Is it 10 minutes? Is it an hour? Everyone might have a different thing? How long does it take you to run one whole file. So this would this is not using the the list of the text files, but just the test just the one test file, I can pull it up. I'm sorry, I have to keep moving zoom around my screen. Just this one test file, drill down or Tiki Bar, whatever. Just this one test file would be very interesting to see how long it takes to run one

Unknown Speaker  29:03  
thing.

Unknown Speaker  29:06  
So there's my negative one and I will do output Roja. Why not? I'm not loading the database. So it shouldn't be slowed down extra by that. So I'm just really curious

Unknown Speaker  29:24  
to see what happens.

Unknown Speaker  29:30  
So I'm going to set this running and pipe it to a log file so that I can do other things in the meantime, but we'll get an idea for

Unknown Speaker  29:40  
test.

Unknown Speaker  29:46  
Okay, that'll be interesting.

Unknown Speaker  29:49  
So, that will give you an idea of what it means to run one file on your setup right now. So you can evaluate how you might want to run many files. There were a couple ways I could think up off the top of my head that you could parallelize. You could execute multiple CMS runs, you could set up a CMS run for each one of your samples using using the file list technique and and run them all the same time. You could create a script to loop through all of your files, swap in those input and output names, and execute CMS run for every individual route file coming in, which would mean you get a lot of output files. You can use the cloud as we're going to learn tomorrow morning, and then you probably given your own computer setup could come up with the code with something completely better. If you have Condor available, the repository does have a job submitter for Condor. I have never explored trying to use Docker containers inside Condor. And so as you know, others may no good solutions for that. And so that might be a discussion point for later. But I wanted to give you two little options. If you use a method of parallelisation word parallelisation, where you get a lot of output route files, route has some methods for you to combine them. So one method is just this H add command, where you tell route, I would like to add together into merged file dot route all of these inputs. And it adds things up. If you have a tree, your tree becomes longer. If you have histograms, the number of entries goes up like that, there is also which I will show you.

Unknown Speaker  31:44  
Oh, of course, I have a fatal exception.

Unknown Speaker  31:50  
I have not set my environment correctly.

Unknown Speaker  31:54  
Not to see him. struggle.

Unknown Speaker  32:02  
Okay, so in I'll show you in here.

Unknown Speaker  32:07  
There is a file called merge jobs

Unknown Speaker  32:12  
and merge jobs. what it wants to be given is an input directory, a directory in which you have the output of many CMS run jobs. And it will do some cool things, it will figure out how many files you expected, it will check and see if if any of them are missing, this is going on the idea that there would be kind of numbers, it will figure out if any of them are to figure out if any of them are corrupt. Anyway, find out if they're missing. And then this is what I really wanted to show you. It's going to use roots, t chain method to open a tree. So this is our events tree. And then you can chain up all of your files. So as you loop over the files, you can add them to the chain as long as they have the same name of the tree inside. So if everything has a od to nano God slash events, you can add them up. And then you can use the merge command to write an output file. So there are a lot of options that will end up inevitably being a little user specific to get your CMS run, processing in parallel, and then combine your output files. So we will also learn more tomorrow morning about how to do this in the cloud. So does anybody have any questions about running their ad to nano VOD script through CMS run?

Unknown Speaker  34:18  
Okay.

Unknown Speaker  34:21  
Sounds like no. So, we are going to move on then to zoom.

Unknown Speaker  34:37  
We are going to move on then to the next lesson a little bit early, which is good. The next lesson which is called object ID and event selection, where we will go through the Higgs to tau tau. So now we're we've we've kind of paused an overview Dude, what it means to have a nano EOD file. Hopefully everybody has gotten at least one little output dot route file that works for you. And so now we'll, we'll go on to look at Higgs to tau tau. Alright. So the very first thing we want to talk about is how to set up this example. We'll check out a little bit of code. And then I have a short refresher of our data frame, which I know you did in a pre exercise, and that pre exercise is is much more comprehensive. So take a moment, take a moment and clone this, hopefully correct. I'll do it at the same time just to make sure branch of Higgs to tau tau. Oh, okay. I'm gonna get out of the area that I made for me to get out of that area. Maybe I'll go back inside workspace. This is not an EDI analyzer that we're checking out now. So it does not have the same path restrictions, as the other one should be pretty safe for you to clone it, wherever.

Unknown Speaker  37:11  
Okay.

Unknown Speaker  37:21  
Please feel free to shout if you have any problems, because we're going to work on a really launch into looking at this file. So if there's any trouble cloning, we should try to address it. All right.

Unknown Speaker  37:46  
Open up skin.cx sex, that's what we're going to start.

Unknown Speaker  38:02  
Okay. And the very first thing that happens here after importing things, including things is to set up the names of the data sets. So let's look at that. First, we have to decide on data and simulation files for our analysis. You learned how to kind of do some scouting of which data exists. And so now we're ready to just kind of think about in terms of physics, so we want to search for Higgs Higgs decaying to tell leptons there are several options for how towers decay, towers often produce new ones towers often produce hadrons. So there are some options there. We have in CMS, of course, many new on triggers, and are AOD to nano God used a muon trigger. So we could do that we could take that tactic, or we say I'll analyze the single muon data set because I'm going to choose events where one tau decay to me or to Taos to continuance. Um, so yeah, you could do that we have chosen to analyze the primary data sets are called tau plus x. And what this typically means is that they have at least events have passed at least one hadronic tau trigger, as opposed to a leptonic decay of the town. So we're kind of pre selecting ourselves into a space where we have hadronic towels in the final state. Then we will go and look for some signals. So we're going to focus on the touto decay mode, but we have two production modes available. So we have a Monte Carlo file for gluon fusion production. We have a Monte Carlo file from vegetables on fusion production. And so both of those will serve as a signal and you can imagine they might have different different properties of the events. Okay. And so then we have to think about the background processes, some backgrounds cannot be estimated well from simulation. This would be if your backgrounds are very sensitive to things like Miss reconstruction, that's hard to simulate Miss reconstructed things. But if your backgrounds are based on physics processes, then we typically can can analyze Monte Carlo at least to get an idea of what those processes look like. So in our case, we're going to say that one tau has decayed to hadrons, and one tau has decayed to leptons. So right away, anything that produces leptons, especially with some with some hadron is coming along, because if we have one hydronic Talon one leptonic tau, we're gonna have some jets, and we're gonna have a muon, let's say, in the event, so anything that produces leptons, along with jets will automatically be a background process, maybe smaller, may be larger, but it will be a background process. So here we've we've pulled out, drill Yon TT bar and a variety of w plus jet samples as kind of our standard benchmark, leptons producers. And all of these will also have jets produced alongside.

Transcribed by https://otter.ai