Subject: Decisions from the Software and Computing workshop
Decisions from the Software and Computing workshop of Nov.29 - Dec.3 at CERN ( http://indico.cern.ch/conferenceDisplay.py?confId=76896
). The decisions were prepared during the week and finally discussed during the Friday morning session.
Distributed Conditions Database Consolidation (Dario Barberis)
Simulation and reconstruction as well as re-processing need only DB releases. Analysis accesses conditions through Squids and Frontier servers at the T1s. Squid caches reduce access time for the jobs and load on the databases. Currently 6 T1s with a Frontier server and 1 at CERN. Discussions with the T1s at ASGC, SARA and CNAF are ongoing to provide Squids. Existing Frontier servers will remain and be consolidated and the sites need to give support in manpower and hardware for a robust service. The number of Frontier services at T1s could possibly be further reduced and the capacity be used for additional TAG DBs.
File Catalogues LFCs (Simone Campana)
We have one LFC per cloud in each of the T1s plus one in the important T2s in the US. We will merge all these LFCs into one at CERN and a live backup in one or more T1s which can also be used for other purposes such as consistency checks, dumps and monitoring. Eventually this master LFC will be combined with the DDM central catalogue that now serves very similar functionality. A staged migration will allow scalability measures. We need to agree this DB service with CERN-IT and work on the migration tools. The migration will take a few days per T1 LFC and should start with some small clouds next year as early as possible.
CVMFS (Rod Walker)
Current software installation at sites suffers from non scalability of shared files systems like NFS and AFS. CVMFS will make all releases available instantly as well as all DBReleases. Conditions data files could also be made available and HOTDISK would not be needed any more. A mirror could be maintained to avoid the CERN single point of failure. Releases must be made relocatable with the $ENV variable. Currently available at some sites and some lxplus nodes at CERN. Sites should be encouraged to install it to gain further experience. More important for problematic sites. Before end January provide instructions and migration strategy.
ADCLab (Graeme Stewart)
A first plan was presented and discussed and generally well received. Needs to further integrated with Software and Upgrade Computing. Will be discussed next week.
Clouds and Colours (Graeme Stewart)
The clouds will have the name of the T1 country: CA, DE, ES, FR, IT, NL, TW, UK, US with 2 exceptions: CERN for CERN and ND for NDGF. A color code for those countries as well as for datatypes (RAW, HITS, ESD, AOD, DESD, DPD and Other) and data priority (ToBeDeleted, Default, Secondary, Primary and Custodial) was decided.
Software (David Rousseau)
The backward compatibility breaking scheme was accepted and a cleanup of obsolete persistent classes will begin. ESD event size will be reduced but not the factor that is needed unless drastic actions are taken. Another handle being is discussed is a heavily pre-scaled DESD to allow for special studies that need more than full ESD information but not for all events. Data reco VMEM limit is increased to 2.5 GB. Reconstructing a JETTAUETMISS event now takes 20sec/event for 3.7 collisions per bunch crossing which will be most representative for the 2011 run. Reprocessing of data and MC during the spring for the summer conferences. Release 17 will be used and the deadline for changes is Febraury 14. Releases scheduled to be deleted were deleted without complaints.
Monitoring (Alex Read)
A choice was made to use the IT-supported ATLAS Global Job Monitoring (GJM) Dashboard for job monitoring and to integrate it as tightly as possible with the second generation Panda Monitor. Information to the GJM database comes from the Panda DB and from jobs directly via ActiveMQ messages. It uses a schema copied from monitoring CMS as well as aggregates for history but it will be discussed if this could all be deduced from the Panda DB directly, using Dashboard infrastructure but without an intermediate schema (report early February 2011). The very unstable Prodsys Dashboard will be discontinued when its functionality is implemented elsewhere (before mid-March 2011). Expert AdCoS shifters and UK site administrators will be asked to start using SSB for daily operations and provide feedback for an evaluation of it in Napoli (early February 2011). Development of the promising DDM Dashboard 2 will continue. Resources to develop further the interface of EGG to ADC will not be allocated.
Extending PD2P (Kaushik De)
Introduced 5 months ago but now used in all clouds but not for all datatypes. Dramatically changed data distribution for analysis without affecting efficiency. Exponential rise of disk occupancy changed to a light increase. It was decided to extend PD2P to all datatypes except RAW and stop all data placement to T2s, effectively transforming all T2 disk space into a cache. This makes it unnecessary to distinguish between DATADISK and MCDISK and is was therefor decided to merge them.
As the T1 disks of are filling up (small clouds more than big clouds) it is also decided to extend PD2P to T1s. All custodial data is kept on tape. Fewer (1 for ESD and 2 for further derived datatypes such as AOD and DESD) primary datasets pushed to T1s. All secondary replica's are made by PD2P based on analysis usage. Still allow manually requested datasets through DaTri. Panda is used to make secondary replicas, initially to T2s and if there are too many waiting jobs also to other T1s (based on MoU share) and then to T2s. This could be repeated to more other T1s if yet too many waiting jobs. It was decided to start testing this in January.
Data Management (Ikuo Ueda)
We will keep a copy of ESD on tape in T1s and start immediately with the data from this fall re-processing. Tapes in T1s have been very much underused. As PD2P will be used also for T1s, DATADISK and MCDISK will be merged everywhere. GROUPDISK at T2s has now become an anomaly and will be moved to T1s. The Central Catalog allows to centrally control group quota and a separate space-token is no longer needed. This way T1s will host the persistent store of all group data with global quota and all transient T2 copies will be determined by PD2P. This change needs to be discussed with T2s that co-locate non-pledged T3 resources or Analysis Centres.
There are minutes attached to this event.