Unknown Speaker  0:30  
What do you think after we've got everybody? Yep. Okay. So Welcome back, everybody, thank you. For those of you who are stuck through this three days of going through two petabytes of open data and some sense or another, I'm particularly very excited for this last session that Suzanne is going to present to us. Suzanne is a researcher at kyungpook National University, and based at CERN, she may not remember, but in one of the very early meetings about ADL I actually attended, because I found it so interesting, with the Open Data effort, there was a real concern about theorists or anybody for that matter who might accidentally find something. And we had had coffee discussions about how you document cut flows. And I remember attending that first session, and I remember being like, Oh, really impressed by the work and seeing the potential. So I'm very excited to, to hear this and see this in the context, I think this could be a real efficiency boost for all of us, not just people with open data, but otherwise. And at this point, I'll hand it over to Susan, for the last tutorial session of the workshop.

Unknown Speaker  1:48  
Great, thank you very much. First of all, let me greet you for a few seconds, personally. And now let me turn off my camera and start the sharing. While I'm doing that, I again, would like to thank the organizers very much for allowing us to present ADL as a potential tool for open data. So what I'm going to do right now, I'm going to first show a few slides to explain what ADL is and why it might be relevant for open data. And then we're going to show a short demo on how to run ADL with the with an open data analysis. Okay, so let me Okay, can you see a full screen now? Whoops, oh,

Unknown Speaker  2:47  
Yes, I can. Yes, yep. Great.

Unknown Speaker  2:50  
So, first of all, why this demo? So in this workshop, we have seen how to perform the data analysis within an analysis framework. And the framework is written in c++ and Python code. And in this framework, well, I have looked into it quite carefully. So physics, content and technical operations coexist, and they are handled together. This is very nice, of course, also, because it's was an easy analysis, an easy exercise. But now the question is, could there be an alternative way to do this an alternative way, which potentially allows for a more direct interaction with beta, and the way that decouples physics information from purely technical tasks. So this is where ADL comes into place. And now, a little bit of terminology, I guess, probably most of you are already familiar with this. But now, as I say, the way we are doing things at the LSE is through analysis frameworks, which are using general purpose languages. And by this I mean languages like Fortran, c++, or Python, which are really broadly applicable across many application domains. And so they're used for solving general problems. But then there is the other side, which we can attempt for now and, and beyond. So we can try to use domain specific languages. So whichever language is specialized to our particular application domain, which is of course high energy physics analysis, in our case, and if we, you know, take these domain specific languages and write them in some general purpose language, they would be embedded DSL, that's another way. And then there's another concept which is called the declarative pneus in a language. And this means that we have a language that expresses the logic of a computation without explicit simply saying how to do it without describing the flow. So you tell the computer what to do, but you don't tell the computer how to do it. Okay, so now this brings us all to the definition of analysis description languages. So what's an analysis description, language or ADL in in the way that we'd like to call it, or the language that we are going to present here. So it's a domain specific and declarative language, capable of describing the physics content of an MHC analysis in a standard and unambiguous way. So that's really the, the definition. And it's designed for us by everyone or anyone with with interest or knowledge in in physics. So this could be of course, experimental ism, phenomenologist, theorists or even, you know, other enthusiasts, like students, people who don't even know much about the technical tasks. So since this is a workshop, for theorists, my theory colleagues would probably be very familiar with earlier efforts or, or languages that we already use, like the Suzy Lucia called on the sushi avantha call. So this language that we're going to present was inspired directly by these accords. And actually, you know, we had, we started thinking about it in English. So it's also documented in various, let's say, the origins have been documented in various language proceedings. So now, when we have an analysis, description, language, what would the language described? So there are several components of an analysis and when we think of an analysis of it's a very, very wide thing. So what could we have, we could have event processing or histogram ng and, and visualization. And we can have fitting and statistical inference part of an analysis. And we can also have work flow management, which covers everything covers and connects everything in an analysis are what we are focusing on here, at least at this first step is mainly event processing. And I'll show what what I mean by that in the next slide. But apart from event processing,

Unknown Speaker  7:22  
we have also started to work on how to incorporate a histogram ng and visualization. Because if you want a real analysis tool, or if you want to perform analysis, even with a domain specific language, we need to think about how to histogram. Okay, so what should be the scope? Again, we said that, which our first focus should be event processing, and what do we mean by that. So we start by input event content. So this means what sort of physics information we have in our data, file objects, attributes, the objects, all the you know, triggers, or whatever information we have an input, and our output is event selection. So our first focus would be the part part between these two things. And this would include simpler or end all composite object definitions, event variable definitions and event selection. So that is our main scope. But as I said, one can extend this a lot, because analysis covers much more than that these things. But since there, we can be very creative. And we can have many diverse ways of background estimation, etc. So that would be a lot harder to standardize. So we start with the simple event processing part first. Okay, so, so these were some generic ideas about about the language. And as I said, well, starting at some resolution workshops, starting some discussions at those workshops, we came up with a language. And actually, there was another effort called kotlin, which my colleague Yocum had started himself and then in the end, you know, we try to take the best of all worlds and combine it into a workable language that can express analysis and that one can even use to run over data. So we are currently calling this language ADL very simply. And what does ADR consists of? It's very simple. So the idea is that we have a plain text file human readable text file, which describes the analysis using an easy to read domain specific language with clear syntax rules. And this file only contains the physics information enough Notes then accompanying that file, we have a library of self contained functions. So, these self contained functions could encapsulate variables that are non trivial to express, you know, we could have very complex kinematic variables,

Unknown Speaker  10:18  
which we cannot describe in a simple human readable way, but which is fine, because in a while with this language, we just aim to organize the physics information in a in a clear and systematized way. So, yeah, speaking of these, these functions, one can of course have, you know, several machine learning based functions, numerical functions, efficiency functions, etc, any kind of numerical or hard to express analytical functions. Okay. So that's, that's really it. So we have one human readable file and then a library of accompanying self self encapsulated functions, that is, that is enough to run the analysis or express the analysis. So now, the point is, how should this human readable file, what should it look like? So as I said, this was this whole effort was inspired by Susie Lucia chord, and his wish event occurred. And these things all have a block like structure, which seems to be very useful and seems to work fine. So we started from that, and so far it is it is working fine. So what we are doing here, we're separating different components in an analysis in blocks. So these components are object selections, variables, or event selections. And then in these blocks, we have a keyword value kind of structure, where keywords specify analysis concepts and operations. And then within these blocks, we have of course, a certain syntax, which we try to make a self describing as possible. And the syntax includes mathematical and logical operations, comparison and optimization operators reducers for it, and algebra, and several very well known hip functions like that phi delta, R, etc. So yeah, this is this is the language in a nutshell, the the idea and, and we have used the language to, to write several LHT analysis as well, we have about 15 of these analysis written and saved in our GitHub repository, which, which you can check. So of course, you know, the more analysis we write, the more we try to improve the language. Okay, so, next, in the next two slides, I try to summarize the ADR syntax, but I'm not even going to show it, I think it would be better for us to look at the the the example analysis, but as I say, these slides are here for reference. Now, this was about the language. Now, how do we use this language to perform analysis. So, of course, we need a framework to run or some software to parse and lead an analysis written in in this kind of a form. So and we do already have tools for doing this. What do we start with, we start with a dll file. And then we start, we also take the self contained functions for complex variables. And then we also take the events, obviously, on which we want to run the analysis. Then there are two ways of doing this. Either we read, we parse the file, and then we converted in a general purpose language code like c++ or Python. And then we compile the code. And then we run that code on events to produce output, whatever that may be cup flows, histograms, select the run scheme files, whatever other results you can imagine. That is one way. And the second way is to bypass the code writing part and directly interpret the language and run it on events. And that is the runtime interpreter approach taken by Kaplan and that is another thing. So yes, these are the two different ways of of running. And and as I say we have tools for doing a form for both approaches. So the first approach, as I said, is a transpiler approach. One can convert ADR to c++ code, and then and then compile it and run it. So this is a piece of software framework we developed with Harrison

Unknown Speaker  15:00  
This framework is executed within the so called the tuple. makers generic anthropo analysis framework. And I'll mention that in an interlude slide just shortly. So this framework can work with any simple editable format. And what it does is to automatically incorporate the event input into the c++ code. So what happens is we input an HTML file and we input a route file with a certain anthropol format. And then we run a script, and we get an automatically generated analysis code. And then we can, you know, compile this code and run it on any kind of event. So this works, but this was written Well, let's say how to put it politely here. So that I don't know and not with using the formal grammar rules, etc. So right now, we kind of post it and then we would like to write rewrite it using the formal grammar. But the nice thing about this approach is that it really automates the the input stage. So one can easily incorporate any kind of editable format and one can also very easily incorporate any external functions. Okay, so I would like to get maybe a minute or two on the Ansible maker itself. So apart from the the analyzer, we also had the role of 10 years ago away to an automated way of making and tuples. So this is not directly related to to ADL. But I just wanted to include this information. So we have a way of creating and tuples with customized content from data in CMS EDM format. And this setup actually directly works with open data set up in root five it works in in Docker, we tried it for example of the works. And and it's very nice. So what one can do is to extract using a GUI extract information from the EDM format and put it into an entity. So the GUI is there, well, you can see a little screenshot of the GUI here. So you can handpick whatever content you want from an EDM file. And you can automatically create a CMS SW kind of cfg file a config file, which you can run on the EDM files and generate an entity. And while doing this, the troublemaker also automatically generates an empty analysis code based on that, that enter contact. So this is where you know the second part is where we have a link with where we can have a link with with abl. So one can you know, if you want to do an open data analysis, for example, in the Higgs to data analysis, the nano AR D only include 1000 neurons. What if I want to do an analysis on the ions and electrons? How do I add them. So here's a very practical way of very quickly, configuring designing and enter content for yourself. And then even generating an empty analysis called or maybe also adding the ADL directly generating analysis called on whatever customize the content. Anyway, also, about TNM. Here is the link with an open data tutorial, which people are welcome to try. Okay, now coming to the code that I'm going to demonstrate today, it is kotlin, the runtime interpreter, which is developed by UConn and partially myself and some other colleagues. So as I said, in the runtime interpreter approach, there's a direct parsing and interpretation. There's no compilation kotlin runs directly on an HTML file and on on boot files. So it's written in c++, it works on any Unix environment, and it's based on root it only requires root in kotlin we use Lex and Yak to automatically you know, to do parsing. So, these are formal tools for for making grammar and dictionaries are automatically generated things. So the exam Yeah. Then of course, around this parser we also have a framework which, which is built around the interpreter. So, this framework again reads events from route files. So now, the reading part is a little bit less automated, compared to the the transformer approach. So we right now need to configure by by well, partially by hand

Unknown Speaker  20:00  
The input files that we need to to read, but that can be done let's say more or less easily and all event types here are converted into some predefined particle object types. So since well coupling is a is a runtime interpreter, we need to have everything already there. So lots of internal functions etc. So kotlin has a lot of internal functions already. And it is more enriched in language content compared to ADL to Tiana, the transpiler. And it gives an output. Okay, so this is the little demonstration that I'm going to show briefly before that, let me conclude or let me just try to say a few words about why ADR for open data, I mean, as obviously, there are already many analysis frameworks out there. So first thing, ADL aims to decouple physics analysis logics and algorithms From Software frameworks. And so it allows to focus on analysis design, rather than the complexities of software framework. And by this, it tries to make it as easy so it doesn't require any software or system level expertise to run analysis. And therefore, it is really easy to use for everyone it is it aims to be easy to, to use for everyone and therefore democratize analysis design. So the second thing is, of course, ADL is trying to be very much in the spirit of long term preservation, which is aligned with the spirit of open data, because we want to decouple the physics information and we want to make it easy, we communicate doable at and shareable, etc. So that is why we think, you know, it's not because the frameworks are better, etc. But it's the approach that we are thinking is very much aligned with open data idea. Okay, so this is summary. I mean, now I'm going to demonstrate very briefly, running setting up and running the Open Data hincks without our analysis, but if people would like to try, we're very happy to, to help with writing and running analysis with with ADF.

Unknown Speaker  22:39  
Alright, so

Unknown Speaker  22:42  
now,

Unknown Speaker  22:45  
I am going to, I hope you can see my terminal here.

Unknown Speaker  22:51  
Yes.

Unknown Speaker  22:53  
Okay. Great. So I'm going to set up everything from scratch. So, we are going to run the exit auto analysis in kopplung. So let me just

Unknown Speaker  23:12  
do

Unknown Speaker  23:20  
okay.

Unknown Speaker  23:31  
Okay.

Unknown Speaker  23:35  
So, while this is compiling, it's just gonna take a minute, I would like to

Unknown Speaker  23:47  
show you the GitHub repository where we are keeping various LFC analysis that we have implemented with, with ADL so far. So you can see the analysis by ID okay. So these were mostly CMS analysis, because I'm from CMS, obviously, and here is our open data analysis. So it's also on get right now and this is the, the ADL let's say file for the analysis. Now one Okay, hopefully if the compilation finishes I'm just going to

Unknown Speaker  24:32  
Okay, it's it's it's almost there.

Unknown Speaker  24:40  
In the meanwhile,

Unknown Speaker  24:43  
let me

Unknown Speaker  24:51  
Okay, so as I said, For so so I'm running this on my Mac. Also, Mac OS, more Java, or however it's pronounced. So it's, it's, it's my personal computer, you don't need any certain setup or anything like that to to run this. So right now, right now kotlin has compiled, you can see, like we main kotlin directory, now I go into two runs the directory where we run the thing. And now I'm now I got the, the ADL file for the for the open data analysis here. So let's first take a look into this file. So this is the name of the file. So as I tried to explain the file is built off blocks with with certain purposes. So we first start with a small info block, which gives us in all sorts of information on our analysis, we actually don't really use this information anywhere, but, you know, in the future, if we do some sort of a database thing, it would be easy to to collect that information. Now, we can go directly to the objects. So, I guess everyone is already quite familiar with the analysis.

Unknown Speaker  26:18  
So,

Unknown Speaker  26:20  
so, now, we have the selection of neons and thous and jets here as the as the simple objects. So, we are defining an object called good neons and we are starting with neons. And then we are literally doing a selection based on the various attributes of the Milan's like srpt and anti tidy. So everything except the for vector is is really what you the we use the names that you see in the antelope. And similarly, for the towels, we start with the cow and then we do the selections that are actually done in the analysis. But as you can see, with those objects, you know, we can derive new objects in our in a pretty straightforward way, which is also self documenting. And then we do the same thing for jets. Now come something that is a little bit, you know, less trivial looking, let's say which is combining the ions and Klaus in order to make a Higgs. So combinations, frankly, are the difficult, let's say hardest thing to express in such a language, because you know, one needs to be very careful, and this is still work in progress, in a sense, but what still there are a lot of things we are already able to do. So, what we are doing here we are combining the good neons and good towels that we have already selected here, no define with these selections here. And then we are putting a Delta or cut on top of this combination. And then if you all remember, apart from the Delta arcot, there were two other requirements in the Higgs to data analysis, we wanted to have that we wanted to select the pair with a Mian, that has the maximum PT and we want to select the tile that has minimum isolation. So, these two lines are are doing this. So, the first expression here, maximum of PT means gives obviously, the the value for the maximum PT in the in the good news collection. And then the second part gives the pity of the meal that we are looking at. And this sign here is is a sign that we introduced here in in the ADR notation to to do optimization, so this means closest to zero. So this way, we are taking we are optimizing and this is what is minus let's say minus in this is our representing, and again, these minus indices could look a little bit complex, but they all they really say is that we are optimizing something we don't know which one to take. Anyway, so what we are doing here is to take the Mian with the PT that is closest to the maximum and we are doing the same for the tops. So this is how we do the heat selection. It's about literally in four lines of off abl style expression. So these are all about our objects and then next Congress isn't can I?

Unknown Speaker  30:02  
Can I ask a quick question? So I may have gotten distracted when you were saying that. So that minus one is just for the nuance. And you said, you know, it's minus one so that because you don't know

Unknown Speaker  30:14  
yet, which one and then the minus two For details, why is it two instead of minus one? Yeah. Because it's, it's, it's a little bit technical issue. So, so it has to do with the, with the interpreter, as a way of way of interpreting Okay. Okay. So, so, yeah, it's, it's, you just have to define a different number.

Unknown Speaker  30:38  
Okay, thank you,

Unknown Speaker  30:39  
and work on me, we can if you want, you can. Okay, okay. So yeah, this is that, I mean, again, minuses are appear only when we have to do an optimization when we don't know which one we are going to take. So it's just to represent the object that is going to be selected in the end. All right, then come the definitions. So we will define, I hope speaks for itself. So it's just you know, to define aliases, or, or shorthand names for, for things, you know, for for rather longer expressions. So the first two are, since in our case, we are selecting a best Higgs pair, so we just want to access the the Mian and the tau in this pair. So that's how we're doing it with with function daughter that we define Of course, this can be done in a better way, but you know, this works for the time being, and then the rest is just standard definitions. So, for example, if you if you want to write even shorter hand, you can define the new Tao pair as the zeros combination, then you can define Delta Rs, you can define MTS for tau and neon. So, these were all variables that are defined in the in the analysis, and you can define a JJ pair, which was again defined in the Indian analysis and you can take the mass and PT and delta eta, etc. And you can even define some, you know, like the weight variable here, of course, I mean, these expressions can all be used, you know, expressions that the right hand side of the equation is equal to sign can be used directly. But writing in shorthand can can also be helpful. Yeah, so, so one can define all sorts of things. And as you see, you know, we are able to express all the simple mathematical expressions, including square roots, and, you know, power signs, etc, etc. Okay, now that we are done with defining event variables, now, we have the the event selection, while this analysis is very easy in that regard, there is not that much of an event selection. So, one interesting thing is that, we can also define event weighting, as here, as you can see here, of course, in this case, it's a cross section times luminosity event waiting, but we can do much more complex stuff. So, for example, if you have a trigger efficiency function, or if you are using any sort of efficiency, or that that is derived some, somehow you can replace this, you're able to replace this number with a function to apply an event wait. So, so we are able to do this. So, we have some more complex examples where we can even read, let's say, Be Tiger efficiencies or whatever from tables that are written in this file and directly apply the weights to event so we can do pretty advanced stuff too. Okay, and this is literally a trigger cut. And then the rest, I hope, again, speaks for itself. So we're asking for at least one good neon or you know, more than zero, good Leon more than zero good towels, and then more than zero, Hicks, hicks new towels combined thing. And then the rest is is just defining some histograms. So I try to define all the histograms that was that are defined in the analysis code. So we have them here. And then the Higgs without our analysis code, also define some variables based on two jets. And we also define them here but we you know, just just to demonstrate how one can have multiple long selection regions which derive from each other. So here we have a different region. which derives from the baseline region up here. So we can easily use the region that's defined about and then go on defining our histograms. So that's really all that analysis is written in a, hopefully a simple way in this fight. So I already used my half hour, I will very quickly run this analysis.

Unknown Speaker  35:31  
So

Unknown Speaker  35:35  
actually,

Unknown Speaker  35:38  
this is the common that we use. And I already had the file

Unknown Speaker  35:44  
here.

Unknown Speaker  35:51  
So now I'm running this for 100,000 events.

Unknown Speaker  35:57  
And there we go, the analysis is run. And it outputs a simple cut flow table and a table of all the operations that we have, and it has like the efficiencies of each step, and it has the event counts, of course, after the waiting, so we have all this information here. And as you remember, we had two regions. So we had the baseline region here. And then we had the two jet's region, which derives from the baseline region. And so that's really it. And after we run, we now have all the output also, in an output route file. Let me show that to you. So the name is this, we just add the history out to the beginning. And then and then let's take a look at this file with a new with a T browser. Okay.

Unknown Speaker  37:08  
All right, that is not good.

Unknown Speaker  37:11  
Sorry about that.

Unknown Speaker  37:22  
Making this bigger by hand is probably better. Yep. So here's our file. And, and as you can see, there are two directories in the file. So if there's a directory made for each, each of the selection regions, so there's one for baseline, and there's another for for the other 14042 jet. And as you can see, we have all the histograms here. So they are, they're all in here. And there's one more nice feature, which I would like to show. So let's go into this baseline region. And let's do an ls. And

Unknown Speaker  38:14  
when we do an ls,

Unknown Speaker  38:17  
we can see that the route file also contains the so called provenance information of the analysis. So it also includes on lists the, the ADL cuts that were used for producing this this selection region, as you can see here.

Unknown Speaker  38:41  
So yeah, so this is this is the idea. I mean, I can, I can show a few more things. But since I am already I've taken a while I don't want to take more time with the demonstration. But I hope this gives a an idea of how ADL is written and how we are able to run, let's say a standard more or less standard analysis using cuplock.

Unknown Speaker  39:08  
Thank you.

Unknown Speaker  39:12  
Thank you very much, Susan, and also Harrison for proposing this demonstration. And if there are questions, you can you can say them now and there's also mattermost channel if these you know if someone later see is this video, I'm afraid some of the people some of the systems to the workshop, already probably thinking about going to bed. So it's pretty late for them. But there will be as I said, this mattermost channel for questions.

Unknown Speaker  39:48  
Yeah, and we are always ready to help. As I say this is just the demonstration is just to give an idea of what this thing is.

Unknown Speaker  39:58  
Jesse Jesse has a question.

Transcribed by https://otter.ai