1) EVNT file merging See: https://its.cern.ch/jira/browse/PRODSYS-576 There are many generator samples where due to low filter efficiency, the number of events produced per task in a 24 hour grid job is small (we have a few samples with the number as low as 10 or 20 and many in the 100 to 500 event range). Storage systems hate small files, so the ProdSys group has proposed adding an EVNT merge phase. In principle, this is a very good idea and the merge transform already exists. The issues we need to deal with are: How do we propagate the metadata from the unmerged dataset to the merged dataset in AMI? End users will only see the merged datasets and will look for the relevant infomration there. Do we add a second e-tag for the merge step? Changing the dataset names will probably break many python scripts people are using to parse dataset names, so we should not make this change lightly. If we decide to do it, we need to make sure we warn people. 2) EVNT to EVNT processing See: https://its.cern.ch/jira/browse/PRODSYS-722 There are many datasets where we separate a given physics process into distinct subsets (top nonhadronic and allhadronic, Sherpa V+jets where the jets are light, charm or bottom, etc) ProdSys does not at the moment support multiple output datasets from the same task so at the moment, we rerun the evgen separately for each sample. This is quite wasteful, especially for slow processes such as Sherpa V+Jets. We've been working with the ProdSys team on adding an EVGEN to EVGEN option. The option to do this already exists in Generate_tf and tests are underway to make sure things run properly in Production. The issues are similar to those above: -How do we handle the metadata in AMI? What do we do with the dataset name. We read DS1 and filter it to make DS2, do we call the new dataset MC16.DS1.DS2 or just MC16.DS2? What do we do about e-tags? is it MC16.DS1.etag1.DS2.etag2? Here again, we should try to avoid breaking monitoring tools and users scripts if possible. 3) Simulation running on the HPCs: Merging HITS files See: https://its.cern.ch/jira/browse/ATLMCPROD-4458 It'ss great the we are starting to see significant simulation samples from the HPCs. In general, the HPC configurations run fewer events per job than the standard Panda sites. We need to move to a uniform output size for each dataset and have that size be sensible. Some of the current simulations done on HPC are extensions of existing samples. So, we have cases where different _tid's in the same dataset have different numbers of events in the HITS file. The proposal is to merge HITS to have files of a common length. Again, we need to resolve the issue of filenames and insure minimum disruption for the monitoring and user code. 4) Multicore Evgen See: https://its.cern.ch/jira/browse/ATLASJT-301 Until now, all Evgen has been single core. It would be good to have the option of running in multicore mode (especially for very slow generators). There have been some technical issues with Generate_tf but they are almost resolved. The questions that remain are: In the JobOptions, we specify "evgenConfig.minevents = N". Here N has been interpreted to be the number of output events after filtering that comes from the job. N is set so that the time for production of a single file (in single core mode) is less than 24 hours. How do we want to interpret N for a multicore job? Should it be the number of events per core (which would be the natural interpretation)? What are the implications of that for the ProdSys (if we have 8 cores, the merged output won't match the magic numbers we use to insure that each evgen file maps easily into simulation jobs) How do we propagate the metadata to AMI? (we need to merge the statistics from all the cores, which could probably be done with a simple Python script)