Julia Andreeva: Monitoring. =========================== Dashboard Job Monitoring application. Focused on analysis activities. Lessons learned: diagnostic coming from applications not sufficient. Site or application problem ? Job monitoring: 200K jobs per days (analysis jobs). Rather high failure rate: 40%. LHCb: Dirac Monitor to monitor. Crashed production applications are alarmed. Quality plot of staging agend: problem at GridKA. Problem solved very quickly. Quality plots coming from CMS Brian's library. One experiment integrating the tools of other experiments. Lack of information in GridView showing T1-T1 due to FTM not yet installed everywhere. Monitoring of critical services published in SLS. Site monitoring: Multiple sources of information. Difficult to understand where to check. Correlation of services functionality among experiments. We need something (cannot look into LHCb Dirac monitoring with CMS certificate). Check the success of the experiments'workflow. A proposal is under discussion: the rule concerning the criteria for the view (data transfer, job submission, etc.) should be selectable. Comments: ========== For Tier-2s the situation is difficult. Tools are needed, even home-made. Some common recommendations are welcome. Answer: Sites should use the fabric monitoring and should publish back in the central fabric monitoring framework. Recommended monitoring systems: Nagios and Lemon. Use same identity for a site. Site names are officially defined (in accounting and reliability report). Power cut at CERN. What monitoring tools can be used to understand what is going on if something wrong happens at BNL (connectivity) because of CERN. Only degraded performance were noted. How is networking monitored ? It is missing. The problem is more on the monitoring level than the debugging. Need to have a kind of experiments console at the site with some high level view to see if data are received, sent, reprocessed, etc. for an experiment. This is in-line with the current monitoring program of work. Does the experiment need to know what is happening at a particular site ? Sites can provide information to experiment monitoring. Common time stamp format in UTC. Simone Campana: Experiment support during CCRC08. ================================================= It is not easy for a site to understand what is going on. Only VO manager have certain information. It should be streamlined. People interesting can subscribe. Notifications for downtimes. Diana Bosio: ROC ================= EGEE Broadcast too complicated for a CERN operator. CERN ROC is not going to send broadcast. No procedures in place, for instance in case of a downtime. OK during office hours. GGUS ticket preferred way for ROCs. Add a string in the short description to trigger the sites to treat the tickt with due priority. Torsten Antoni GGUS: ==================== A new field with the ticket type. Discussion on who is authorized to raise an alarm. Personal e-mail addresses cannot be used to contact people. Categories: Switch T0 to CERN. T0-T1-T2 is used in the MoU. Phone system needed over the alarm e-mail system. Can the procedure be made lighter and everything only being handled by phone only. It works for the CIC. Mail or ticket is the same. It is up to the sites to choose the interface. Attempt to be systematic and have just one interface. PIC report ========== gsidcap problem with dCache non-reproduceable. Monitoring tools for tape recalls and usage. Condition files. It would be good if they are put in one directory. Collaps of the lan during CCRC08. GridKa report ============= LFC crashing VO Box credential problems Limit the number of skim jobs. Activities from all VOs did not happen at the same time. PIC was fully loaded. However, no competition for resources. Categorize jobs per I/O. Use different roles in order to differentiate the workload. RAL ==== Combined reconstruction and reprocessing should be done. It should be scheduled before July 14 (starting of French holidays).