Unknown Speaker 0:00 Steady recording. Okay. Unknown Speaker 0:04 Okay, so welcome, everyone. So my name is Gabriella. I am also in these slides that will be shared soon. But my last name is Benelli. I work here at Fermilab at the LBC. And want to welcome everyone to this a CMS Open Data workshop for theorists. Thank you for signing up and for joining us this morning. And we will have, you know, our as you all know, we're having our workshop fully remote. And it's our first full scale workshop that we're having. We had a lot of events, but there's a full scale workshop with tutorials and everything that we're doing this way. So we thank you for your patience as we go through this and, and we still have a welcome to Fermilab that will be given by Liz Sexton Kennedy, who is the CIO Fermilab. So she is not working in a management. But she has been in CMS for a very, very long time. And she helped set up the very tools and software that is being used to do this open data. And she has been involved in so many different aspects of software and computing in CMS. So Liz, can you give us? Unknown Speaker 1:30 Yes, good morning, everyone. And welcome to the for me that bell PCs, CMS Open Data workshop. It is a topic that's close to my heart, I remember, you know, back in 2011, standing before the collaboration board, as the software coordinator, and really pushing Unknown Speaker 1:53 the Unknown Speaker 1:55 whole concept, it was all new back then, of course, the person who really made that all come to fruition was kotti, who's also on the line. And I want to give a shout out to kotti because she has worked so hard on all of this through the years, you know, it was a thing that I felt we had a more responsibility to do took 10s of billions of dollars to build the Le Elysee and to collect this data that you're going to be learning how to use. And I really feel with all that it's our responsibility to squeeze as much physics out of the state as we possibly can. And that includes all kinds of, you know, measurements that were maybe not the primary focus, you know, the more intensity measurements, but all kinds of wonderful things are laying in this data that, you know, the collaborators themselves don't have 100% of their time to get to do that squeezing. So I think this is very important workshop. And I'm glad you're all here and welcome. And I hope you enjoy it. Thanks. Unknown Speaker 3:09 Thank you, Liz for also the welcome from from the management. And we now go to the introduction to the workshop by Jesse. Jesse is also connected. And the floor is yours, Jesse. Unknown Speaker 3:31 Alright MIT. Okay, thank you, everyone. And thank you for joining for this, this workshop. And so I'm normally one of the organizers of this, but really the hard work has been done by my co organizers. So I recognize them in a moment. But I wanted to welcome all of you to the CMS Open Data workshop for theorists and other interested people interested in the future of public data access, both in collider physics, but also thinking about what might be relevant for other fields. Unknown Speaker 3:55 So this workshop is of course sponsored by the Fermilab LPC and just a shout out to Matt Egger kotti as my my co organizers, Gabrielle Kevin from the from the LPC as well as a number of the people who are going to be serving as facilitators and presenters this week. And this is the look of a befuddled theorist looking at equations and plots and detectors and then figuring out how to actually install the CMS software and then saying, you know, then what, what do we do once we get to this point, and that's where you're all going to explore this week. And you know, the goal of this workshop is to enable theorists to use real experimental collider data in their research. And this is something that I've benefited from and I hope all of you will benefit from as well. So just a little bit of a broader perspective. You know, CMS is really a pioneer in the release of research grade public collider data. Other experiments have of course released data for more educational purposes, but CMS is trying to make sure that the workflows that were used for actual analyses will be available to the public. There have been five releases starting in 2014, with the release of half of the 2010 initial LFC data, and four releases after that, with increasing sophistication increasing information, including, for example, samples relevant for machine learning applications as well as simulated datasets. And with this release, the community, the broader community. So let's call us the theory collaboration. And with an appropriate coffee based logo, here's the informal spokesperson of the theory collaboration, one of one of my heroes, I mean, Arthur. And I don't think she ever played with public data, she was more on the mathematical side. But external and internal users, that is people who are even on the CMS collaboration are taking advantage of this unique scientific resource. And I've tried to be comprehensive of all the papers that I could find that cite open data usage, I count 13 of them, there may be more. And these red arrows point out when this publications hit the archive. And the hope is that there'll be a growth, both in more public data releases from CMS, as well as more analysis. And I think the workshop attendees may be the ones who will be generating the next wave of CMS open database results. So I'm gonna be giving a colloquium this afternoon. And I wanted to just give you a little bit of a preview of that colloquium is kind of inspiration and motivation. And one of the things that I'm going to emphasize in the colloquium this afternoon is the CMS open data is a fantastic resource with many exciting applications. And I agree with Liz, that we have a more responsibility to make sure this data is public. And there's also exciting opportunities for educating future scientists, stress testing archival data strategies, enabling exploratory or proof of principle studies, facilitating dialogue between theory and experiment, and of course, researching physics in and beyond the standard model. But these are only going to be possible with sustained investment in public data initiatives. And part of my colloquium will be just showing the type of science that one can do when you have public data. Now, when I was running out of time when preparing a talk, and I realized that I had a lot of cool things in backup slides, and so I thought I might just spend time here just saying what are the things are in the backup slides and what I did, just to get a sense of the scope of what people have been doing with public data, I took all those 13 papers that I mentioned before, and just picked out one picture from each. And I'll just go very quickly through the types of science that has been done already with the public data, hopefully as inspiration for the types of things that you might want to do with datasets. So you know, one thing that have happened, our standard model analysis, and this started with work that I did with a chaise teapot the way to Angela koskie, and simanim arzani, where we had a particular way of processing jets, we had a theory idea that theory ideas represented here by this red curve. And I won't go into the details of what this theory idea was. Needless to say, when we had this idea, we had the opportunity to test an open data, we jumped at that opportunity. And you can see fantastic agreement between a theory prediction and an experimental analysis. And just to give you a sense of the timescales, you know, this data that we did this analysis on was from 2010, the release of this public data set was 2014. But the theory idea didn't come around until 2015. And so one can kind of think about this for the future. What if you have a theory idea where the experiment could even be, you know, completely completed, would you still be able to go back and do an analysis, and we were able to do that analysis, released in 2017. And of course, as we all know, all truth is on Twitter. So this is an important analysis really, so says Twitter. And this is around the time of people excitement about our sub k. But at least this one Twitter user, could be about thought this is the most interesting paper of the day. There's been subsequent analyses, going into more detail about for example, doing electric benchmarks and comparing the results that you would obtained from open data to results that are published by CMS and Atlas. And the authors here are actually experimentalists who were in some cases involved even in making these green CMS curves. And basically stress testing what type of information is available in the open data? Can we really go back in time and reproduce old results? So Standard Model analyses are one thing that you can do with open data. Another thing you can do with open data or searches for physics beyond the standard model, and so this afternoon, I'll explain a new physics search that I did with Casey serrata. You have time sir Matt strassler and way too. And another example of a bsm search is search for non standard parody violation by less than shot. And this is an interesting analysis because this is the type of analysis that you have to think outside the box to want to do it's non standard parody violation because this is parody. violation that doesn't appear in Unknown Speaker 10:03 generically in the context of quantum field theory. So you're really going very non standard. But looking at whether you have a symmetry essentially left right symmetry, explored through a novel jet observable. And one of the quotes that I really like from their work, it's hard to imagine any reason why every possible attempt should not be made to test and retest the fundamental symmetries of nature every time a door opens onto a new energy range. And so while from the perspective of quantum field theory, we'd be surprised if we didn't see a symmetry in this alpha variable between left and right. Nevertheless, I think we have an opportunity and an obligation to test and retest and explore and re explore, especially when we find ourselves at higher collision energies, or with increased in sizes of datasets. So bsm searches are one things that you're able to do with this open dataset. Another thing you can do are machine learning studies. And one of the things that comes with the open data is simulated datasets. And the simulated data sets allow you to stress test ones analysis ideas, but with labelled data, because it's coming from simulation. And so there's been studies of end to end classification, anomaly detection, testing out computer vision techniques, detecting out kind of graph type networks or interaction networks. And some of these are just based on using the simulated datasets. Some of these are based on real datasets, I think this anomaly detection work does a TT bar analysis using the real data set as a as a cross check. And with the rise of machine learning, there's a rise of a need for samples that the whole community can use and benchmark against each other. And so I was excited when the CMS Open Data released this kind of more machine learning focused data set, since that would give us a common data sets for for comparison. And let me just take this opportunity to make a shameless plug. I'm having another hat that I'm wearing as director of a new National Science Foundation, ai Institute for artificial intelligence and fundamental interactions. And if people are interested in using open data in the context of machine learning studies, there's actually a postdoctoral fellowship opportunity with an October 20 deadline that you can click here to see if you're interested in potentially working with us. So that's my shameless plug. Coming back open data. There are many more things that you can do studies underlying event studies of, of hardware acceleration for certain calculations. And something I'll talk about this afternoon doing event space geometry study, I did with Patrick miski, Rodimus Andrea, Eric metoda impression I. And if anyone here or anyone, you know, has done a CMS open data that I haven't data study that I haven't talked about, please let me know, I'm trying to make sure that we can build up a community and support each other if by nothing else then citations to to emphasize the value of this dataset. So let me just wrap up with with a few more thoughts. And just go back a little bit in the past, just to emphasize the types of opportunities that you might face and also some of the challenges you might face and going back to that other 27 kilometer circular Collider, namely left, of course, the precursor to the LFC in the in the CERN tunnel. And I think there's a really instructive case study that can tell us what we're facing and the need to have a community of people scrutinizing these datasets. So this is analysis that was done by Jennifer Kyle and Julian Vaughn was back taller, where they were analyzing archival data from Alif. So Alif is one of the four lab experiments and they did a good job of archiving their data set for future use. And there's various folks, including myself, who are trying to make sure that this data set actually can be as easy to use as the CMS open data and hopefully even easier. And what they found in this archival analysis was a puzzle in quadratic kinematics, they were not trying to find new physics, they were trying to revisit Unknown Speaker 14:06 z poll data. But in the process of doing that they were comparing their results to Monte Carlo. And they found that if they isolated a particular phase space region, and isolated further, they found a feature and this is a statistically significant feature. And one that they couldn't explain easily in terms of a detector effect couldn't explain easily in terms of a modeling effect. It didn't have a bsm explanation either. And it's a puzzle. And it's a puzzle that right now, to my knowledge, there's only been two sets of eyeballs looking at this puzzle, in part because this data set isn't isn't so easy to access. And so, even just cross checking something like this will be important. And then highlighting, you know, cases where there may be fluctuations or maybe fluke. So we need to understand are these signals for physics beyond the standard model are these effects that we detect are effects that we don't understand? Are they just fluctuations that would go away if you combined all the datasets from the for lab experiments. And again, it kind of, for me an inspiring quote from from them. Whether the excesses described here ultimately explained by QC or physics beyond the Standard Model, our results demonstrate the lasting utility of the archive lab data. And I think this is true also for the CMS Open Data studies that you will be doing that no matter what you find, they're going to have lasting value for pushing the field forward. So my my last slide, you know, something that I've been doing some of this since 2015 2016 2017. And what I've realized is that data preservation and outside analyses require significant resources. It requires people like you here at this workshop, your time, your ideas, and eventually money. And I think this investment is worth it. I think the work that the CMS open data team is extremely valuable. And I hope it continues. And let me just conclude by saying thank you, both the organizers, the teachers, the facilitators, and especially the participants that are all here today for investing your time, your idea and yourselves in this workshop. Unknown Speaker 16:12 So thank you. Unknown Speaker 16:17 Thank you. Thank you so much. Jesse, I know that you have a commitment in a couple of minutes if people have questions just is going to be around after. But he has to kind of go right now. And we invite everyone to that colloquium that is going to be at 4pm Fermilab time, so central daylight saving time. And for somebody may be late or early. But yeah, we hope that you guys can join. So you got a preview of that. Thank you so much, Jesse for the intro. Yeah, yeah, thank Unknown Speaker 16:51 you very much. And my apologies that I have to go teach for a few hours, but I will return. Unknown Speaker 16:57 Yep. Okay, so I think that then then I can just share a few slides that are mostly logistics type of slides, um, one second. Unknown Speaker 17:14 Okay. Unknown Speaker 17:18 So I hope you guys can see my slides. So basically, as you all know, as we said that this is going to be a fully remote workshop and I just for the fun of it that I put the outline of what normally in a logistics slides, I will be covering so the access policy so the badges all the things that you have to worry about coming into the lab, that networking getting your Wi Fi is going coffee breaks, floor maps, dinner, car sharing, all of these logistical stuff that we don't have to take care of. And it's kind of sad, that we cannot be having coffee together and get to know each other that way, but we have set up something as a caddy that will follow this presentation will show for people to share about themselves and get to know each other a little more. But, so, in the outline really the parts that are also the tours are gone and cannot show you the zero or or the nutrient facilities here. But the things that are left are the way of communications that we will use for for the workshop and how you you know that getting to know the people and and the connection to the OPC. And so let me actually start from that the connection to the LPC. So I don't know if everyone knows about it, but so LPC stands for LG physics center and it is a hub that is set here at Fermilab before all of the US CMS. So we are in Wilson hall that is the high rise building here at Fermilab and, and we have a one and a half floor and we have you know about 100 people at any time that are around but we have a lot more people that come and go and use our facilities. So you can check out our bulletin that is something I encourage all of you to do. And on that bulletin there is also a link to add our public event calendar to your Google calendar so that you get to know about events or whether or not you are in CMS. Many of our events are public, not all of them of course. And then there is this link that I have here about visiting the LPC where you can find information if you actually were to come or wanted to just get in contact with us about coming spending some time here. There's different things that Fermilab offers for people to be visiting. For. For CMS collaborators, of course we have programs, specific programs for people to spend significant time and have also traveled funds with all this. So there's this link to the programs as well. But of course, there's the link about the current situation, right now the access is restricted. So there's we are all working from home, and only essential workers, people who have to do with accelerators, or you know, infrastructure or going to the lab. So hopefully, you know, in the future next year, sometime, you can come and visit us and it would be great. So at the end of the workshop, they will, there's going to be a feedback survey. And you can also let us know if you want to be notified of future public LPC events. And of course, if you are CMS collaborator, there's also links, where you can subscribe to some mailing lists that are designed for being part of it. So switching to the modes of communication. So being in a remote setting, we, for efficient communication, we are using mattermost. That is an open source chat, live chat program from that is supported at CERN. So the link that you can see over here is the one to join the team if you didn't yet. And if you didn't, please do so now. So click that link. And that link will require you to login using your CRN SSO your credentials, even if you have a lightweight account, that would work. So just know that sometimes when you click that, since he will try to open a webpage, it may send it to the SSO and then when you get there, it just sits somewhere in limbo and not take you to our team. So you can go back to the link and click it again. Unknown Speaker 21:47 The recommendation that we have for everyone is to actually download the desktop app to make sure that you have the latest and you make sure that you have the latest version installed. And to do that, once you are in that web interface, you can go there is a little menu next to your name, after you're logged in. And when you click it, it will you can go scroll down towards the bottom where you can log out. And you will see that there is a link download apps. And when you click that, then you can download the one appropriate for your platform, whatever you're using. And after you install the app, you'll need to point it to the CERN mattermost server and they put the link here on this slide. So you will automatically when you join the team, you will automatically be added to two public channels townsquare and off topic. But please be aware that you should browse the channels, the public channels that are available, and you should really join tomorrow. I mean, they're the first four here were the ones that some of you already did, for doing the pre exercises. And then there's two that were created for today's parts of the tutorial. And so you can click on those links also to join. But you can you should really get to the habit of looking into it. How do you browse channels, the public channels in our team, because we will add more channels probably later. So finally, I have a you know, in the next presentation, Katie will go more details about the whole schedule and how the whole worship has been organized. But here's the link to the Indigo agenda. If you're here, you're probably already have it. And as you probably seen already, we will we're planning on having morning sessions that are going to be synchronous. So they will include presentations and lectures and they will be recorded. And then there will be in the afternoons a synchronous work sessions that will work. So where participants can work using mattermost. And they can exchange information that way. And, you know, that way, you can do this at your own pace. But it's very important to foster this self support. Unknown Speaker 24:02 kind of situation using mattermost. So the fact that you asked for questions there, and you are if you already solved the problem, you can help somebody else solve it as well. There's also a link that I'm posting here to the CERN Open Data forum, where many of the questions that you may encounter by going through the workshop for the first time and some some of the things for the first time are probably been already encountered by other people. So there's a forum that the link is here, where you can look for your answer to your question or to your problem before asking but if it is not there, go ahead and in so hopefully we can get it there. In general, you know about you know that the reason why we have madness is because we want those questions to come. And in general. asking for support is always very important that we try to be as detailed at what if you if you're having a problem you would post, you know, like issues, if you have an issue that is an error or a warning, you will try to show that information. So that that people can help you solve it. So, we hope that communication will be going flowing. Then one last slide about the the people so that you can refer to this to find these names that are on mattermost will try to give support, as we go through many of these people will be actually lecturing or giving instruction directly, others have prepared some of the exercises or tested them. And, and will try to be connected to help you when things get it. So, in general with matter most like with any lab chat, if if you start getting into problems that are, let's say very technical or very specific to your platform, or to your laptop or something like this, still don't hesitate to ask once the problem could be identified or something like that. Some facilitator may direct message you and take that out of the you know that channel until they get solved so that you can feel more comfortable sharing more information and not, you know, taking the whole bandwidth. But But please, you know, we want to lower the threshold before you interact, because interaction is how we all learn. And then we want to make sure that we give you support, so that people can interact also one on one. And you can see, so the I think that having the names here helps also to get a reference of people that have helped with this. So with this, let's have a great CMS Open Data workshop. And so we'll get to Katie, to get an introduction.