ServiceX working meeting

Benjamin Galewsky (Univ. Illinois at Urbana Champaign (US)) , Marc Gabriel Weinberg (University of Chicago (US))
Service-X Call July 14, 2020


Alexx P, Andrew E, Ben G, Chris W, Gordon W, Ilija V, Kevin P, Marc W, Mark N, Peter O, Rob G

Andrew has been granted write permission to the servicex-frontend repository, so he will be able to assist with  development there. 
User sign up has been greatly improved (by Andrew) to include many more fields which are relevant. 
Marc has removed all 'ATLASisms' from the transformer. 
Ben is working on a change to ensure that the specified transformer image exists before trying to start it up. He wants to do this before renaming the transformer repository. 
Gordon expressed surprise at a renaming, as he had thought each transformer would be kept in its own repository, and he expects each experiment will want to have control of their own transformer code. 
Peter asked what progress there is toward user isolation. Ben answered that it is not on the roadmap for v1.0 at all. Peter is concerned about aspects of the API like the route which provides a listing of all requests from all users. He thinks that at least the common case is that users will want to query only their own requests. Gordon was also interested in having such a capability for debugging. 
Gordon asked whether the API specification in the merge request for error reporting is final, since he wants to implement using this in the frontend library. Ben said that unless someone has a reason to change it, he considers it finished. 
Gordon has been collecting error messages that he has gotten (from grid infrastructure upstream of servicex). Service-X can't generally fix these, but some of them may be possible to react to and work-around. He has observed that rucio's catalog is frequently out of date, and updates seem to be rather slow. He has also observed configuration errors at sites, where files were accessible by some protocols, but not others, and a number of transient permission errors. Currently, when a file is cataloged to exist at multiple locations, but right now the DID-Finder picks just one replica, which makes it impossible to fall back to the other if the first is unavailable. 
Kevin mentioned that CMS has experienced similar issues with files which are supposed to be available, but xrootd fails to serve them. He thinks that it is important to keep pushing the xrootd developers to fix this more completely. 
Gordon has been reporting bad files to ATLAS one at a time and he discovers them; since service-x is likely to discover many such files reporting them automatically could be useful. 
Kevin has had a less positive experience in CMS than Gordon has when reporting problem files, but he thinks that automated reporting might help push site to improve. 
Ilija noted that ATLAS does not technically require xrootd, although it is needed and generally provided in practice. He has previously automated testing of xrootd services, and experienced month turn-around times for fixes. He also noted that the frequency of files having more than one replica is decreasing with time, so the value of switching replicas will also diminish. 
Rucio provides meta-files for datasets, but these can become rather large in the case that a dataset exists in many places. 
Peter asked whether an xroot redirector implements fail-over to different replicas, or whether it must be implemented in clients. Ilija said that no redirectors are in use here, so any necessary logic must be in the client. 
Ben asked whether it would be feasible to require any dataset to be transformed to be pre-staged to MWT2. 
Peter thought that such a requirement is not feasible, and would eliminate a major part of the value of service-x. 
Rob commented that remote access has traditionally been de-emphasized in ATLAS. He thinks Gordon's points about the complexity are important to address.  However, he does not want dealing with grid data transfers to delay developing the rest of the service. He thought that to demonstrate v1.0 it may be necessary to rely heavily on pre-staged datasets. 
Peter thought that might be feasible for v1.0, but problematic in the long term, for example because only U.S. users can store files at MWT2. 
Gordon thought that much of this is a larger policy discussion; maybe a small group (Marc, Ben, Rob) can brainstorm how this should work in the future. 
Rob asked Gordon how much disk is needed to support ServiceX users for six months. 
Gordon and Peter thought that about 10 TB of MC data is typical per-analysis. Gordon thought that perhaps four or five analyses might use service-x in the next six months, but he and others felt quite uncertain about this. A scale of 50 TB seemed relevant for the short term. 
Rob suggested that for the longer term it may be possible to propose changes to adjacent systems which will make this all work more smoothly. 

TreeMaker transformer for CMS

Marc and Ben have been working with Alexx on this. 
The goals are to read two different CMS data formats, one of which is complex, and one of which is simpler. 
TreeMaker is a tool to read the more complex format, and has now been successfully packaged in a Docker image. 
A CMS instance of Service-X is up and running, with full features. 
The TreeMaker transformer has been successfully read lists of input files and sent its transformed data to MinIO. 
The next step will be to enable the TreeMaker transformer to authenticate to read general CMS data. 
He would like to get a new namespace on River to develop and test the CMS transformer, where a CMS grid cert can be placed and Alexx can have direct access. (This will happen next week when Lincoln is back from vacation.)
In the longer term he would like to transition away from using users' certs to set up service-x and get service certificates issued for it. 
Ben noted that CMS seems to index only nanoAOD in Rucio, which makes it difficult to work with miniAOD with service-x. Alexx said that the plan is for miniAOD to be in Rucio as well, although he does not trust this to happen quickly. 
Marc hopes that service-x can insulate users from needing to know the details of Rucio. 
Currently DAS is used more than Rucio. 
Marc suggested that an interface to DAS might be a short-term solution. 
Ben wanted to know whether DAS had a python interface; it does not, as it was rewritten in Go due to poor performance in python. 
Alexx suggested that a good solution to support testing would be to get one dataset loaded into a test Rucio instance. 
Ben suggested running COFFEA on River, or Service-X on FNAL's cluster. 
Mark asked whether Service-X has enough functionality to do useful analysis. 
Peter said that 'just a column extractor' is already a valuable tool. 
Alexx reported that he has been able to observe  the autoscaling in action for the CMS transformer, but asked why he had observed four transformer pods being started to process three files. Ben answered that this is because average CPU is used as the metric for driving the scaling. 
Alexx mentioned that as an (admin) user, it is impossible to tell which pods are idle, and it would be nice to have more insight into this. 
Ben asked whether it will be feasible to replace Crab in the near-term. Alexx thought that this would be much farther off. Alexx found the related work he had to do to package his transformer was probably beyond the reach of most users. 
Kevin noted that Crab can be used to run any type of job on the grid, and while service-x might replace some of the major uses, it does not cover all of them. There is also CMS-Connect, which provides a simpler way for users to submit grid jobs. Therefore, service-x should be positioned as replacing one class of uses for Crab, not Crab itself. 
Alexx noted that CMS makes effective use of Singularity containers, but that this use is very different form what Service-X demands of docker containers: The Singularity containers are distributed through CVMFS, and also obtain much of their specific software that way as well. He noted that the transformer image will need to contain custom software for different groups of users. 
Kevin explained that TreeMaker contains a limited set of 'producers' which know how to read and manipulate certain datatypes, any additional datatype will require the inclusion of other producers which are not in that image. 
Mark thought that creating n-tuple files is something that probably works better with traditional grid jobs anyway, so while it is something Service-X can do, but should not be the sole focus. He is also concerned about the idea of having to move large quantities of data to MWT2. He suggested that the focus should be on enabling low-latency analysis, and backing away from anything which involves high latency, for which the grid is more suitable. 
Kevin thinks that converting xAOD is a problem mostly unique to ATLAS; CMS already produces nanoAOD which leaves nothing for service-x to do. He is not enthusiastic about forcing users to rewrite their code for a different framework. 
There are minutes attached to this event. Show them.
    • 12:00 12:20
      Action items 20m
      Speakers: Benjamin Galewsky (Univ. Illinois at Urbana Champaign (US)) , Marc Gabriel Weinberg (University of Chicago (US))
    • 12:20 12:40
      ServiceX for CMS 20m
      Speaker: Marc Gabriel Weinberg (University of Chicago (US))