Unknown Speaker  2:50  
So our next

Unknown Speaker  2:52
instructor is century. He's from CERN, his researcher, fellow researcher at CERN. And he's prepared that lesson about something which is very important. And sometimes it's one of the most difficult parts of an analysis and determining or estimating the background. And so he's going to talk about the or teach you about the ABCD method, apply to estimating the background, the QC the background. So all right, center, it's all yours. Maybe I create a few seconds to end the poll.

Unknown Speaker  3:32
Meanwhile, I'll start sharing my screen.

Unknown Speaker  3:39
Okay, I hope you can see.

Unknown Speaker  3:49
Yeah, we see it gray. So if you want to see the results

Unknown Speaker  3:56
Yes, let's have a look at the results just to know how where people are.

Unknown Speaker  4:03
Okay, good. So it looks like

Unknown Speaker  4:12
we have certainly made some progress. Now we have essentially one third all the way done up to the final steps one third, skimming the files and then one third getting things started. So that looks good. That looks very good to me. Thanks Erica for for making this poll so that so that we know what's going on.

Unknown Speaker  4:40
Okay, then

Unknown Speaker  4:45
moving to the topic of the background estimations. So for this lesson, we are going to follow this tutorial that that is linked in the workshop page in the agenda So you can just open it and find all the instructions here. But I will also share it here so that I'm kind of, we will just slowly go through it and discuss things. And, as always, feel free to interrupt me if you have any comments or questions. It's Friday evening, here at CERN with a long week behind us So, so if I'm saying something incoherent, or something that doesn't make sense, just please comment, and maybe it really didn't make sense. And then we can discuss it. But, but let's see how it goes. Good. So, yes, so we will be discussing background estimations. And this will be based on this Higgs to toto analyze this example that you've been working with yesterday, and today already. Hmm. So in practice, it doesn't really matter what how you have it up and running, if you have it, whether you are doing it locally, or in the cloud now or somewhere else, adjust as you prefer. And also, if you don't have it running, yet, no problem, you can just focus here on the conceptual side and see as we go, what it looks like to run it and then you have all the instructions here and you can just follow up later on and reproduce all the steps yourself. I'm good. So, all these instructions, I just basically wrote you will probably find some typos there, I will try to fix them as we encounter them. Um, and all the code that we are using is written by Stefan so then later if you run into any problems, if it's about the background estimation method itself, you can contact me directly. If it's about this particular code and some technicalities, then you can just contact Stephen or you can just email both of us if you feel like that. Good. So, just to motivate what we are going to do, some of you might have now the output plots yourselves, but at least if you just go to this this GitHub page of the seeks to Tao analysis you will find some example plots regarding the output of the analyzes here for example, here is one of the very very like out of output products of this whole analyzes the mass of the d tau system with some sort of signal peak visible here. And now, if we look at this plot, so, this as as we have discussed already earlier, this stack of histograms here shows you the the sum of all the different background processes. And then here in the legend, you can see which color corresponds to which process. And so, by all these backgrounds, we have estimated from simulation by processing the simulated samples, which you have done, and which has been discussed previously. And then by normalizing them to the cross section and integrated luminosity, which was also discussed yesterday. And then we end up with some estimate for each background. And this is true for all all these different background components, except for this one here, the bottom one here, which is the QC D Multijet background. I mean fact this background is not estimated from simulation in this analysis, but we estimated in a data driven way. And this is exactly the ABCD method that I am going to talk about here. Okay, so then we move to the introduction. So what do I mean by data driven background estimation? And what is the ABCD method.

Unknown Speaker  9:21
So, when we talk about data driven background estimates, we mean something that is mostly based on data, but possibly not fully. So usually we also need some input from simulation. But still, the main features of our background prediction are coming from data. And the idea is that if this is actually based on data, then we trust this background estimate more than something that is purely simulation based. Because, of course, there is a risk that something in our simulation is wrong. We have some wrong model or some wrong assumption or some bug in the system. And then if we get our background estimate wrong, then our conclusions about are we seeing the let's say, the heat signal that we are interested in, are we seeing there or not, the conclusion can be completely wrong if we don't have a correct background estimation. So this is really critical for for any analysis. Okay, then Moreover, even if we decide to, for this reason, also, if we decide to use the simulations in the end for some background predictions, then you usually still we want to somehow validate or confirm these estimates that we get from the simulations by using data. So, for this also, this data driven methods can be useful. Okay, and this ABCD method is now one particular data driven background estimation method, which is actually quite common. So, if you just select random CMS or Atlas papers, quite often you will encounter some variant of this basic approach. And that's why we are going to discuss it here. So, the idea is that, we define somehow four different samples of data by four different sets of gods. And one of these, this region D here is our so called signal region, we will get very soon to what we exactly mean by signal regions, that's the region in phase space where we expect to see our signal for example, the Higgs decaying to two towers in this case.

Unknown Speaker  11:50
Sorry, yes. Go ahead. You're up to one second. Yeah. I'm just wondering why if we, if we can rely more in, you know, background estimation, down with data, why then we want to do this by Graham estimation with Monte Carlo. I mean, we call do any background estimation with the with data, I guess. So why we bother doing it with Monte Carlo?

Unknown Speaker  12:24
Yeah, that's a good question. Um, so first of all, usually? How to put it? First of all? Usually, we need, we need MonteCarlo. And we use it in many stages. I mean, already, before doing the data analysis, basically, we needed the Monte Carlo to design the whole experiment, we needed the Monte Carlo to get an idea of what kind of measurements we can do with the lhg in the first place, etc. So so so like, in general, using Monte Carlo is certainly I mean, there are several motivations for it, then, in the context of one particular data analyzes. Once we have the Monte Carlo actually use, it's usually the fastest way to get forward is just to grab that Monte Carlo, which we do have for all major physics processes already. And then just use it to get the first background estimate. And then the question is usually after that, the question is, do we want to make sure that the Monte Carlo is giving us the right answer do we do we want to refine this estimate to get smaller uncertainties etc. And then all of this we can get with a data driven estimate. But on the other hand, if we were just looking like purely at the data, without any MonteCarlo, you know, in our hand, it, it would be pretty difficult to say what is actually going on in the data. Because, you know, if you for example, if you look at these plots, the data we see is a sum of many very different physics processes. And to understand how we can get some handle to each of these processes, probably we are going to use MonteCarlo anyway. But yeah, ideally, I would say almost in any analyzes, ideally, these two approaches support each other. So we can kind of cross check our background estimate and our most data driven estimates and our Monte Carlo estimates by doing both and seeing the things agree. Does this clarify it a little bit? Yeah, thank you. Okay, good. Yeah, that's definitely a good question. And, yeah, also, I mean, whenever You start to do an analysis it's always a valid question which background to want to take from data which you want to take from simulation and why Indian Alright, yeah. So, the ABCD method Yes. So, this is our signal region. And then the there is the idea is that usually there is some background process that we want to estimate in today it will be the QC D Multijet background and then we will select some other sample of data which we refer to as the control region with a slightly different gods and use the sample to extract our estimates for the shape of this background distribution. Um, and then, in order to use this background shape in the in our actual signal region where we want to estimate the background, we might have to correct it somehow and this correction we call transfer factors these are the event weights that then we use to correct this estimate that we get from this region see, so, then we have a couple of extra regions where we derive these transfer factors and then we apply this correction from the transfer factors to let's say to modify the template that we get from this region c. So, that then we have the final prediction in this region D. Here you can already see how these regions are defined in the context of the six to Tao analyzes we will come to it again a bit later. So, we are we are altering the our isolation criterion for the Mian and then we are altering whether we require the moon under hydronic TAO to have the same charge or the opposite charges and then this defines our four regions okay, but we will come back to this. So,

Unknown Speaker  16:55
um this I think this all we already this

Unknown Speaker  17:01
cost

Unknown Speaker  17:07
you Yes indeed. So, then I think we are ready to move on to this discussion about the control regions and of course, with that comes also the discussion of what is our signal region? Yeah, so, so, now, the question is what do we exactly mean by signal region and control regions and the answer is here. So, by signal region, we simply mean the exact region in the face base that is defined by our signal selections. So, what are the signal selections signal selections is just the sum of all selections that we apply starting from the trigger and then all the schemes that we do to the data and then all the final selections that we do using all the different physics objects etc. And with the signal selection, our goal is usually to define a signal region where we have as good as possible signal to background ratio. So, we want to have a region where we have lots of signal and as little backgrounds as possible, so, that we can get a strong result Okay, then as we already discussed, in addition to this, we can also select, we can use slightly different selections, and then we end up with different regions that we call console regions. And the major difference between these two is that the console regions are selected in a way that they do not contain much of the signal that we are searching for. So, for example, now, when we are looking for the seeks to our signal, we need to make sure probably we do it with a simulation that if we use some console regions, they contain very little signal ideally there would be no signal in our control regions. If there is a little bit of signal, then at least there should be overwhelmingly more background in Van signaling that region so that so that it's essentially signal free. Okay, I mean in this ABCD method, as you already saw, we have three different control regions, which are the A and the B and the C and then we have the signal region, which is the D

Unknown Speaker  19:46
Hmm.

Unknown Speaker  19:48
All right. Sometimes you also hear people talking about sidebands. This is often just a synonym to control region. But especially in the case where you have like say some mass distribution, where you're seeing now gives you a clear peak. And this is your signal window, let's say, then people talk about the signal window. And then then the control regions are often the side bands of this distribution where you don't have the peak. So left or right to speak, then you have the sight of mind. So that's the explanation for for this name. But yeah, often these are used kind of interchangeably and people can even talk about sidebands even if there is they don't correspond to any particular site in any actual sense.

Unknown Speaker  20:39
All right.

Unknown Speaker  20:44
So, now we can proceed to see how we do this whole ABCD thing in practice. And we start now from this by defining this control region see, that we use to extract the shape of our QC D Multijet background in the context of six Dotto analyzes.

Unknown Speaker  21:10
So now,

Unknown Speaker  21:13
this is already implemented in the code. And if you go and look at this, histograms script that we use, after the scheming script to produce a route file called histograms dot route, which contains all the input histograms that we then use for plotting later in this script, you can find the lines that I show here, we can also just go to look at this script directly and if you have it open, you can just go and have a look yourself. And here we even have some explanations of what is happening. So for this single region D, we can see that here the require the meant is that the charge of the muon, which is q one times the charge of our top candidate, which is this tool, the the product of these is negative meaning that that they have opposite charged right. And this is of course, if we have a neutral Higgs decaying to a tau and hydronic tau and immune certainly, if the charge is called uncertain conserved, then we require these two to have opposite charges. So this is a very reasonable requirement for our signal region. Whereas then, for this control region, see, we change these criteria, so that we actually require that this product is larger than zero, meaning that they have the same charge both. And now you can you can already kind of expect that with this criterion, we should not have very much signal entering our control region because basically what this requires is either that the charge of the moon or tow would be Miss identified Miss reconstructed, which is possible but very rare, or that we are picking up a wrong thing wrong toe or wrong muon or something that is not even a toe or a Mian, something goes goes wrong when we select this time on pair, and then we could end up with something with something that with something that looks like a moon or something that looks like a toe when which in fact, however, the opposite charges, sorry, the same charge both okay, but this should be quite rare. So we are kind of confident that this is a good selection for a control region. And then we go and do this. And then we fill these histograms

Unknown Speaker  23:55
with

Unknown Speaker  23:58
with both selections, and you will see that we end up if then if we look at the resulting histograms. So this is what is here in in this challenge. Hmm. So now you can if you are able to run the analyzes up to the point where you produce these histograms. You run this histograms, Python script, you end up with this histograms dot root file, and then we can have a look at it with a T browser. So let's do that. I think Yeah, so I have already run the script. So I can just open the T browser and then I will share it in the moment and we can see what is going on here. One question. Sure.

Unknown Speaker  24:57
I mean, for example, in this part of the Codex See, for example, it says, The F dot filter, blah, blah, blah. What is this df stuff?

Unknown Speaker  25:07
Yes. That's a very good question. So this df refers to root data frame, which is the the root object that we use here to process the data.

Unknown Speaker  25:19
If a router roof it.

Unknown Speaker  25:22
This is route, actually.

Unknown Speaker  25:26
This is a rather new feature. I mean, relatively new. I think it's been been there for a few years now. But relatively new feature of, of Route this data frame. If you just go Google route data frame, you will find a lot of documentation probably already about it. So yeah, so when we, when we say data frame dot filter, we say that take the whole sample of events that we have in this tree, and select only the ones that pass this condition.

Unknown Speaker  25:57
Thanks.

Unknown Speaker  25:58
No problem. Are there other questions before we start looking at the histograms?

Unknown Speaker  26:03
Yeah. So we will run this. This is crude despite this Keystone pie in the container, right? The Docker container while the miracle machine? Yes, well, yes, indeed, we go there, we should have a histograms dot route file coming from the a yogi to me to know the AOD. process. And we'd run over this histogram that would file

Unknown Speaker  26:39
a well, actually, yeah. So the You Can we can see the steps here. So the first thing is to compile and run this scheme script. I think this was, in principle, this has been done already, at some point. If it if it hasn't been done running this scheme, depending on your computer can take some time. If you have a, I think it can take anything between 10 minutes and 10 hours depending on your machine. So yeah, it can take some time. But once you have run the scheme, you have a bunch of output route files, there are several ones of them. And then these are the inputs to this histograms. Python script.

Unknown Speaker  27:29
All right. I think that is a scheme. scraped was was one of the steps in the Julia in the Julie's

Unknown Speaker  27:39
lecture, too. Yeah, indeed,

Unknown Speaker  27:42
we'll get. So that's that. I mean, I mean, we get from other of these conversion and AOD treatment, we get a bunch of, well, a number of Route files. And that's what we'll use in this this.

Unknown Speaker  27:59
Right, exactly. So so just to make a connection. So probably you didn't get to run this skimming, because it could take a while. But this is also what what Adelina and Clemens were showing doing in the cloud. Right. And so after you run, you will have a bunch of these histograms. As the output Adelina showed these histogram, they had some issue there. But you know, it's the same process. And and then those will be the input to what center is explaining here.

Unknown Speaker  28:37
It's in two ways of doing this.

Unknown Speaker  28:40
In bullet, thank you.

Unknown Speaker  28:43
So let me see if I can switch the share.

Unknown Speaker  28:50
For a moment, I think this is my terminal. Yes, I hope you can see it. And now it should be a bit larger. So all these different when you run the scheme, you get one root file out for your simulated process. So you have this one, this one, then your data here, then the TT bar VBS W's. So you get a bunch of these root files. And then these are the input to this histogram script, which gives you out these histograms dot root. Yeah. So yeah, if you don't have the skim done yet, it's absolutely no problem because then you can just have a look at at at the output, which I will share to you right now.

Unknown Speaker  29:49
So here is the, the I

Unknown Speaker  29:54
yes, now I'm sharing the T browser. Good. So just looking at the output Route file, then you see that we have a lot of histograms here. So we have for each simulated process. And for each data sample, we have basically all the histograms that we want to produce listed here. And because we just a moment ago, we defined the single region and control region in a different ways. Now we have two versions of each histogram. So for example, if I look at this visible mass off the I think this is then the visible mass of the tow, huh, this is now for our gluon fusion signal sample here. And you see something like this. Okay, so we have a beak. And we have here almost 5000 events, these are just these just a number of simulated events. No, no reweighting not not No. cross section, integrated, luminous, nothing is considered here. This is just the roll number of simulated events. But I think it's for now it's good enough for us. Then if we go below, we will find all the same histograms with this extension CR, which means control region. And now if I look at the same histogram in the control region,

Unknown Speaker  31:37
here is what we get.

Unknown Speaker  31:40
Actually, I have to take back what I just said, I think these histograms do include the normalization of the signal. The histogram itself is normalized to the cross section and luminosity. Hmm. But this number of entries shows you just the number of simulated events. Okay, in any case. Here you see, first of all, that this normalized amount of signal is very small, if you integrate over the whole thing, probably you sum up to one event. And then also this raw number of simulated events. So it's roughly 100, whereas we had 5000, in our signal region. So at least this quick check suggests that this control region doesn't have a significant signal contamination, so it's a good control region in this sense. Then, of course, if we want to check carefully, we would, of course, we should compare the sum of both signal processes to the sum of all backgrounds, at least the similar backgrounds that we can get from here to get some idea, huh. But this first check, at least looks good. And I hope you get a general idea that now we have for each process for the signal and control regions. We have the histograms here, and then we can play with them a bit later. Okay, any questions or comments before we move forward?

Unknown Speaker  33:22
I don't hear anything. So then let's jump back to the instructions that we have here. Here we are. Okay. So now we just went through this challenge of producing this route file. Okay, I had it already there. And we checked that we have histograms there. And we compared what they look like in the signaling the control regions. Good. Then the next question is how do we estimate the QC D background in this control region? Hmm. So, as is the case without signal region, if we if we look at the plot that we have here, as we were discussing, the dates are in our hearing, our finger region is a mixture of these different processes. So this is usually also the case for the control region. It's very hard to define a very pure control region where you just have one single process that enters there usually have many different processes. So now if we want to estimate one particular process, here is the it's the QC D Multijet then we need to somehow extract what is the shape and what is the event yield from the QC the Multijet events in this region See, and this is where now we need the help of our simulated samples. Hmm.

Unknown Speaker  34:58
So what we do here

Unknown Speaker  35:01
Now, we we actually, we need to take a look at the next script, which is plot dot bifen, which is then the script that produces all these final plots for us to look at. So let's open that script and have a look at what is going there. So there is a lot of stuff in the script. First, we just define here some labels for each of the plots that we want to produce these other tests that come to the x axis of the plot, we define some colors for the different processes. Um, we have some helper function to to pick us pick up one histogram from this histograms dot route file that we just produced, then we have a lot of plotting settings here. Then here, we actually use this get histogram helper function to pick up the histograms for each process. And the same for data. And then we enter the part. So I know this is going really fast, you can have a you can take an hour and have a look to understand each line of the script later, but now we will focus on this QC D estimation part. So now what we have here is data driven QC D estimation. What we do is that first we build our data histogram. Hmm. And now we are looking at this control region see that we use for background estimation. So we have this post fix car as we pick up the histogram. Hmm. So we create a histogram that we call QC D, optimistically. And we put there the data from the run B, and then we pick up the data from the run C and we add these together into this QC D histogram. Hmm, but Okay, so now, of course, this histogram that we get, it's not really yet QC D, it's just the sum of the data and we know that the data contains all the possible physics processes that we can have there in the region, see, hmm, so now what we do is that we pick up the histograms for all the simulated backgrounds, which are listed here. So the drill gun, they saw the sorry, the W plus jets, the TT bar, and then the Alienware the Zed boson goes to two electrons or murals or to two towers. These are all our background processes, except for the QC D Multijet that we are interested in here. And now, what we do is that we sum these background processes to the skew CD histogram, but with an extra factor of minus one. So, in fact, we are not summing them. But we are subtracting each of these other backgrounds from our date. And now the idea is that if we take the data, and we subtract all the backgrounds that we can simulate, then we are left with the one thing that we didn't have in the simulations, which is exactly the QC the Multijet background. And then finally, here, we do a check that as we do this data miners, other background thing, we don't accidentally get negative event yields, because it's not very unphysical, to say that we have minus five events of UCD in this beam. So if we end up with a negative number, then we just set it to zero. I'm good. So this is what happens in this plotting script for the QC demodulated backgrounds. And then after that, we just proceed to actually draw all these different histograms that we want to see. And then we have, again, lots of drawing options here. And then in the end, we save the plots.

Unknown Speaker  39:17
I'm

Unknown Speaker  39:18
good. So that's, that's how we get the console region. See, and that's what we do with it. Um, are there any comments or questions now before we move forward?

Unknown Speaker  39:34
Yes, I have a question short.

Unknown Speaker  39:38
Will this change much in the case that you cannot simulate

Unknown Speaker  39:42
any background?

Unknown Speaker  39:45
I guess that no, basically, just to can define a control region like a region where you have signal no and Jessica with all the data.

Unknown Speaker  39:55
Yeah.

Unknown Speaker  39:58
Yeah, I think in print Simple, in principle, this kind of situation where you cannot simulate anything as long as I mean, as long as you have a simulation of your signal you can, you can determine if, you know, you can determine what is a good signal region. Maybe somehow I'm what this console region that is free of signal, that's true. In practice, it's very rare that we wouldn't be able to simulate any background, at least, let's say with a standard proton proton physics at the lhg. There are always the few Standard Model A process these which are the dominating ones, and which one is the largest then depends a bit on the selections that you do, etc. But for all these most common processes, you know, we have the simulations. So yeah, that would be really, really exotic analyzes, I think.

Transcribed by https://otter.ai