WLCG CCRC'08 Post-Moretm Workshop


Friday, 13 June 2008

Notes taken by Pablo Saiz

____________________________________________________________
************************************************************
14:00 CMS post mortem  
    Daniele Bonacorsi

____________________________________________________________
************************************************************
____________________________________________________________


During the month of May, CMS was running two challenges simultaneously:
 *) iCSA -> CMS specific
 *) CCRC'08 phase 2 -> common with all the vos

The T0 workflow contain three activities: repacking the tests, T0 processing  and the CAF.
The repacking, done with CRuZET was successful. It was immediately followed by merging and reconstruction.
Out of 3200 jobs, 28 failed. There were many issues copying the files to CASTOR: the number of drives allowed to 
write were too small; several SRM problems; files were not migrated to tape, and CMS noticed it

The CAF resources were increased to the 2.1 MSI2k and 1.6 PB that were pledge. The CAF is in good shape.

Concerning the data transfers, CMS transferred ~3.6 PB of data in 32 days (including the debug and production instances)

For the T0T1, the goal was 600MB/s sustained over (at least) 3 days per week for (at least) 2 weeks. 
5 T1 (ASGC, CNAF, CCIN2P3, PIC, RAL ) managed to do it, whereas the other two (FNAL, FZK) only achieved the target one week. The problem with FNAL still has to be investigated. With FZK, the issue was that they focussed on production data, and there was not  enough to fill up the transfers. At the beginning of June they repeated the exercise with debug data, and the target was achieved.

There was also an issue with CCIN2P3: during a couple days the transfers were not working
Fabio Hernandez: The problem was not a CCIN2P3 problem. The CERN CRL was delivered expired.


For the T1T1, each site had to replicate the data in at least 4 other T1 within 4 days. All the T1 managed to replicate the data to the other 6 sites. 

The T1T2 transfers, each T1 had to transfer to T2 in its region plus other T2 in 4 out of the other 6+1 regions. There were two cycles: 16-23 May and 24-31 of May. In the first cycle, CCCIN2P3, PIC and RAL did not manage to do all the transfers. However, all of the T1 met the target in the second cycle. 

90% of the commissioned links were tests (commissioned links =~ 2/3 total links)

The T2T1 transfers ran simultaneously with the downlink testing. It was very stable.
 
The T1 workflows include skimming and reprocessing at a large scale concurrently with the data transfers. On average, for the reprocessing these are the slots at each T1: 800 FZK, 400 PIC, 500 IN2P3, 300 RAL, 300 ASGC, 300 CNAF, 3k FNAL For the skimmming, there were on average between 500-1000 slots. There was no skimming at FNAL

The average job duration was 25 minutes. At FZK, it was much longer (12-24h), possibly due to slow data access.

John Gordon: Were the jobs faster in castor because the data was prestaged?
Daniele: FZK had some issues with dcache, and that's why the jobs took longer. At the same time, all the jobs succeeded

T2 workflows include MC production (although it won't be reported here) and Analysis. 
The analysis was done in four phases: preparation, controlled, chaotic, and stop-watch (where the T2 do everything). 

John Gordon: How do you make the jobs go to the T2?
Matthias Kasemann: The analysis is not allowed to be done in T1
Daniele: In general, you can run in anything in DBS. In the first part, the submission was done pointing to the site
For the second part, we were sending jobs for files that were only in T2

To sum up, the transfers are mostly ok. It was the first time that the reprocessing and skimming ran together with the transfers, and the results were positive. 

The CCRC exercises are very useful. We learnt a lot from the components that have been tested. In particular, the multi-vo exercises are crucial. 


Questions:

John Gordon: At RAL, we saw a lot of user work. Why? 
Daniele: That was not associated to the CCRC. Users can still submit jobs using DBS
John Gordon: We are not bothered by them, but we don't want them to access tape
Matthias: As long as they don't compete with production, users are allowed to run


Q: What was the unknown category of jobs?
Julia Andreeva: We have some limitation in the dashboard for the pending jobs. We know the job was submitted, but until it
starts running we don't know the category. We are working with L&B to improve this.

____________________________________________________________
************************************************************
14:00 ALICE post mortem  
    Latchezar Betev

____________________________________________________________
************************************************************
____________________________________________________________


The list of tasks that ALICE does are: 
 o) registration of data in CASTOR and on the GRID
 o) replication T0T1
 o) condition data gathering
 o) reconstruction,
 o) quality control
 o) MC production and user analysis at CAF and T2

During may there were a lot of detector activities, and the upgrade of the sites.
During that time, the registration and replication kept going. In total, 84 TB of data were collected. 
The raw files have now a size of 10GB, which should help the MSS requirements
Only the raw data was replicated

The conditions data has been in operation since mid-2007. During the last 3 months, it was stress-tested.
The fast reconstruction was also tested. 

For the analysis, the critical data is replicated immediately to the CAF, and users can analyse the data there.
In addition, the user Grid Analysis was ongoing. 

To sum up, we are reasonably ready  for LHC data


Questions?

Harry:  Where there a lot of problems installing new version?
Latchezar: In two weeks we deployed in more than 70 sites. And it was a new version of operating system, vobox, AliEn...
Giving the amount of changes, this is quite an achievement 


____________________________________________________________
____________________________________________________________
************************************************************
14:00 Critical services
    Presentations from the experiments

____________________________________________________________
************************************************************
____________________________________________________________


ALICE: Patricia Mendez and Pablo Saiz
****************
The list of critical services for ALICE, and the max downtime is: 
 o) Site voboxes 	-> 2 hours
 o) Castor+ xrootd T0 	-> 2 hours
 o) mss@t1            	-> 12 hours
 o) FTS T0T1		-> 8 hours 
 o) gLite WMS or RB	-> 12 hours
 o) proof@CAF		-> 12 hours

During the CCRC there was a good reaction of sites, and fast recovery in case of problems
All the vo-boxes have migrated to SLC4 and gLite 3.1

For the monitoring of these services, there are several tools. MonALISA is used for all of them. Some are also
being monitored by Gridmap, the dashboard, gridview and SAM. It is good to have several sources because then 
you can detect more problems, and you can also compare with the other VOs.

Questions:

Fabio Hernandez: The voboxes have a time to recover of 2h!! I'm not sure that the sites can promise that
Latchezr: If the vobox is down for less than 2 hours, there is no degradation of the site. If the vobox is down for more
than that, then the site will stop submitting jobs and transferring data. To improve the reliability, sites can have several
voboxes


ATLAS: Birger koblitz
*******************
The ATLAS critical services include everything that is needed for centralized file transfer, production and reprocessing. 
The full list can be found at https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasCriticalServices

The services are monitored by SLS, and servicemaps and the dashboard are used to follow up the performance

ATLAS has introduced a distributed computing shift 16/7 in two timezones. They are looking for a third time zone to cover 
the other 8 hours

The power cut on Friday May 30th was an useful accident to verify the alarms. CERN-IT and ADC Operations handled the 
event very well. Nevertheless, the procedures and checklist should be improved

There was a big problem with kernel upgrades, that made the Site services unstable.


Question: Since you have already quatorize your components, you could test all these kernel upgrades before they come into production
Birger: The problem is that we are running one data export. We don't have the means to run two instances. We need 
more people to handle it. We would like to get the notice of the kernel upgrade in advance, so that we can check ourselves that
there are no problems



CMS: Daniele Bonacorsi
******************** 

The list of critical services can be grouped in the following categories: 
*)Maximum support (WLCG/Grid standards + expert call-out  + 24/7 on-call) 
ORACLE, CERN SRM, CASTOR, DBS, CASTOR pool, batch queues, kerberos, networl Cessy-T0, cern internal network, telephones, web "backends", 
cessy-T0 transfer system
*) Key production services (WLCG/Grid standards + expert call-out)
CERN FTS, PHEDEX, Frontier, LaunchPad, AFS, Tier0, VOMS, Myproxy, BDII, cern external network, ProdManager
*) Other services
apt, build machines, tag collector, testbed, cmsdoc, twiki, cms tb, SAM, dashboard, PHEDEX monitoring, lemon, webtools, email, smtp, HyperNews, savannah,
CMSSW server, cvs, linux installation, telecom, vlagrind machines, performance benchmarking machines, indico, LFC

CMS is working on creating mailing lists to replace personal mails. All communications regarding the status of the services should be done
through those mailing lists. 

The power cut was a very useful exercise. The services recovered nicely. The recovery list was ready, and the person in charge
went through it, and all the services were up before lunch


Philippe Charpentier; How many computer shifters do you have? (also to ATLAS)
Daniele: CMS doesn't have a stable shift exercise. It will use a list of volunteers. We should be running with 1 shifter.

Birger: For ATLAS, 2 people external to CERN, and trying to implement the third shift in Asia.
Trying to set a 24/7 shift in the control room by 2 people, working also in the central services.

Sophie: Do you need the central LFC at CERN?
Daniele: Nope, it is not used anymore.
Sophie: It is still being used by some users


LHCb: Roberto Santinelli
***********************

The full list can be found at https://twiki.cern.ch/twiki/pub/LHCb/CCRC08/LHCb_critical_Services2.pdf
The services that were used during the May phase were SE/SRM, DIRAC3, FTS and LFC 

The services are monitored indirectly by performance degradation and directly through SAM and SLS web pages

The future directions include looking into the logging system as the primary source of information. Moreover, the agents 
parsing the messages logged by the DIRAC components should be able to spot and identify problems

Harry: How many shifters?
Philippe: We will be happy if we get 2 shifter per day.

John: You mentioned problems talking back to the bookkeeping. Do you have a SAM test?
Roberto: Nope, not yet

____________________________________________________________
____________________________________________________________
************************************************************
14:00 Conclusions 
	Jamie Shiers

____________________________________________________________
************************************************************
____________________________________________________________



The CCRC is not over. The three metrics that have been used to monitor and report progress are: scaling factors of the functional blocks,
critical services defined by the experiments and the WLCG MoU



The middleware version and databases  met all the requirements
The storage is not there yet. We should try to find solution for what is really needed, and focus on what is required today

In the T2, MC is well run. The analysis still has to be scaled up
In T1, data transfers are under control. The reprocessing for ATLAS still needs to be demonstrated
T0 include the Critical Services' lists. We know how to run stable and reliable services. It takes some discipline, but it is doable

For the next workshop, planned for November 13-14, would it be enough with a 1-day workshop? 
For the CCRC09 there will be some changes, and we have to make sure that the system keeps working

From the services point of view, now we have to work on making them even more stable. 


Finally, don't forget the BBQ on Wednesday 25th of June. 
Thank you for your attendance