- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
OR click HERE
NB: Reports were not received in advance of the meeting from:
------------------------------------------------- AsiaPacific ------------------------------------------------- -> Major Operational Issues Encountered During the Reporting Period == ROC Report == <Site issues> Open GGUS tickets status: TW-FTT: #29850- TW-FTT enable VOView in the bdii (f-ce01.grid.sinica.edu.tw). JP-KEK-CRC-01: #29445- MyProxy failure on dg15.cc.kek.jp -> need more information from site admin. PAKGRID-LCG2: #27112- SE failure on SE.pakgrid.org.pk #26941- CE failure on CE.pakgrid.org.pk ->Site is in SD till 2007-12-03 5 Sites NOT publishing accounting data: Australia-ATLAS: 19 days HK-HKU-CC-01: 24 INDIACMS-TIFR: 70 (SD till 2007-12-07) JP-KEK-CRC-02: 55 KR-KISTI-GCRT-01: 33 <Other issues> TWAREN network maintenance on Dec. 1st. Start time: 2007-12-01 09:01 End time: 2007-12-01 14:01 Sites in APROC affected: NCUCC, NCUHEP, TW-NIU-EECS-01 == T1 Report == Network * Established a new CERN 5Gbps link which is now in production since on Nov. 29th * Testing Tokyo ICEPP network performance over new 1G connection * Flapping with multiple outages on backup CHI-AMS link, requesting report from network provider. T1 * build up bdii load balancing and fail over * rename lcg00126 to bdii01, the second bdii is named bdii02. * base on keepalived( integrated vrrp and ipvs ). * one level HA, director routing load balancing. * using a sample script to check bdii service instead TCP port check. Castor * Added 40TB into cms CSA tape pool * investigating Castor tape migration performance with rfcp testing * one bottleneck found is at the uplink at the network switch integrated in disk server blade chassis * additional links will be added to increase network bandwidth ------------------------------------------------- CentralEurope ------------------------------------------------- -> Major Operational Issues Encountered During the Reporting Period Accounting for site with SGE batch system. https://gus.fzk.de/ws/ticket_info.php?ticket=29426 https://gus.fzk.de/ws/ticket_info.php?ticket=29550 Site installed lcg-CE glite 3.1 on a host with Sun Grid Engine batch system but got into troubles while trying to publish APEL accounting data. While trying to solve the issue the site was told they should not use uncertified (in term of etics) batch system at all. -> Points to Raise at the Operations Meeting 1. SAM Apel test. When it is scheduled to become a critical test? -> Availability report CYFRONET-LCG2 Tier-2 site remarks that while analyzing availability reports it is hard to determine the reason for decreased availability because the tools which affects (FCR) and computes (GridVIEW) availability base on SAM results which are available only for last 7 days. We are aware the longer history is a performance problem but maybe it would be possible to provide an interface to show some short period of SAM results in the past? ------------------------------------------------- CERN ------------------------------------------------- SFU-LCG: We have 400 queued atlasprd jobs for 10-cpu cluster. Some SFT job fail because they could not be run for a long time. CERN-PROD: Scheduled intervention on LSF subsystem. Has been announced, and a downtime was scheduled in GOCDB. CERN-PROD: Soon after the release of GGUS we received a number of update e-mails from GGUS concerning the verification done by the users of (sometimes) very old tickets. As the corresponding tickets were already frozen in our internal TT system, this caused a lot of new tickets to be opened. The issue was not systematic, in the sense that it did not concern tickets in the whole history, but it was however significant We are asking the GGUU team if thy are aware of possible causes. We reckon a post mortem analysis as envisageable in order to correclty record and address the same issue for future releases. CERN-PROD: Submission storm due to WMS bug. affecting CMS. This started on Tuesday evening went on until Thursday evening, and overloaded both the batch system and the CEs hosting the jobs. Due to this CERN hosted more than 30k GRID jobs for quite some time, and we passed a limit on the maximum number of jobs allowed in the batch system. This limit was increased from 50k to 75k to allow new submissions. ------------------------------------------------- France ------------------------------------------------- ------------------------------------------------- GermanySwitzerland ------------------------------------------------- -> Major Operational Issues Encountered During the Reporting Period Report for Tier1 GridKa (FZK): [ author : Jos van Wezel] ---T1 site report went missing in this ROC pre-report---- Reconstructed here: Short SE service interruption for an emergency update of the dcache SRM on 29/11. Severity:low ********** Report for ROC DECH [author: Clemens Koerdt] o 15 German/Swiss sites in production running with gLite 3 o Specific news by site * (none) o WN MW version o 5 sites gLite 3.1 o all other sites: gLite 3.0 o WN OS overview o SL version 4 (7 sites) o SL version 3 (5 sites) o Debian (1 site) o CENT OS (1 site) o SUSE 9 (1 site) -> Points to Raise at the Operations Meeting Issues compiled by ROC DECH [author: Clemens Koerdt]: ------------------ 1.) Some sites are unsure about the correct procedure to introduce new service nodes in the production environment. Now that GOCDB no longer allows sites to switch off the monitoring the sites should put the nodes initially in ''maintenance''!?! Once they are in maintenance, can monitoring be switched off? What about the procedure if a nodes needs to be decommissioned? Set first into maintenance, then delete from GocDB, knowing that SAM continues to test for another three days?! 2.) Ticket https://gus.fzk.de/ws/ticket_info.php?ticket=28099 remains in status ''assigned'' since already two weeks now. 3.) At least on site report went missing in this week''s ROC pre-report. -> Availability report ---T1 site availability report went missing in this ROC pre-report---- Reconstructed here: Short SE service interruption for an emergency update of the dcache SRM on 29/11. Severity:low ------------------------------------------------- Italy ------------------------------------------------- ------------------------------------------------- NorthernEurope ------------------------------------------------- ------------------------------------------------- Russia ------------------------------------------------- 1. It seems like some users try to submit jobs to the sites bypassing RB/WMS system, directly using CE job submission APIs or globus tools. What should we do with this (i.e.: don't care, encourage, prohibit in some way)? 2. One of russian Alice managers asked us to "install pbs client on the VOBox". We are wondered if this should be allowed at all or not. What is the common practice on other VOBoxes? We certainly would not like to allow any grid users to submit jobs directly to the CE bypassing the grid layer from the VOBox. ------------------------------------------------- SouthEasternEurope ------------------------------------------------- ------------------------------------------------- SouthWesternEurope ------------------------------------------------- ------------------------------------------------- UKI -------------------------------------------------
Availability report : BNL-LCG2 -> Remark[s] on 2007-11-25 Saturday Nov 24 Problem: panda monitor on gridui03 was not available for 16 hours. Cause: high data movement and nfs problems caused the monitor to hang. Solution: we will move panda monitor machines to .54 subnet. -> Remark[s] on 2007-11-27 Problem: User output file failed write to dCache Cause: dc008 run out of inodes on the file system. Impact: read pool was disabled, complaining no space. No data can write in. Solution: Create a separate 4gb partition mounted as /controldata with a default inode size of 1k. -> Remark[s] on 2007-11-28 Problem: Prestage requests in dc027 were stuck. Cause: NFS client has problem in dc027. Could not list /HPSSBAT/atlasdata/ Impact: Prestage requests could not send to HPSS Solution: Reboot dc027 -> Remark[s] on 2007-11-29 wendsday Nov 28 Problem: Panda monitor machine gridui01 crashed for 2 hours Cause: High memory/high load caused the machine to go down Solution: Machine rebooted. High memory usage must be adressed by developers -> Remark[s] on 2007-11-30 Problem: Machine dbarch5 and the database on it is not available Cause: Machine taken down to move it to another subnet Solution: This is scheduled downtime, machine will come back when work is completed. -------------------------------------------------------------------------------------------------- Availability report : CERN-PROD -> Remark[s] on 2007-11-30 Scheduled intervention on LSF subsystem. Has been announced, and a downtime was scheduled in GOCDB. Submission storm due to WMS bug. affecting CMS. This started on Tuesday evening went on until Thursday evening, and overloaded both the batch system and the CEs hosting the jobs. Due to this CERN hosted more than 30k GRID jobs for quite some time, and we passed a limit on the maximum number of jobs allowed in the batch system. This limit was increased from 50k to 75k to allow new submissions. -------------------------------------------------------------------------------------------------- Availability report : TRIUMF-LCG2 -> Remark[s] on 2007-11-27 SRM trouble. -> Remark[s] on 2007-11-30 SAM test fail everywhere(?) -------------------------------------------------------------------------------------------------- Availability report : SARA-MATRIX -> Remark[s] on 2007-11-25 Problem: GIIS old entries found in sitebdii Solution: One time error, went away by itself. -> Remark[s] on 2007-11-29 Mainentance due to necessary immediate upgrade of dCache. The red is due to SAM problems. -> Remark[s] on 2007-11-30 Problem1: GIIS old entries found Solution1: One time error, went away by itself. Problem2: import_cred.c:160: gss_import_cred: Unable to read credential for import: Couldn''t open the file: /opt/edg/var/spool/edg-wl-renewd/48548840f49ff0d9359531e927e61fd6.177 Solution2: One time error, went away by itself. Problem3 and 5:lcg-rm test timed out after 600 seconds Solution3 and 5: went away by itself Problem4: srmAdvisoryDelete failed. The error messages was: lcg_del: Communication error on send Solution4: went away by itself. -------------------------------------------------------------------------------------------------- Availability report : pic -> Remark[s] on 2007-11-27 Date: 26/11/2007 from 12:40 UTC until 15:40 UTC Problem: A failure in the internal pro-active monitoring system (Ingrid) caused the site-bdii.pic.es to fail during some hours. Severity: Medium. lcg-utils commands failed, since SEs were not in the infosys. Solution: Restarting the site-bdii.
Time at WLCG T0 and T1 sites.
Please read the report linked to the agenda.
https://twiki.cern.ch/twiki/bin/view/LCG/TransferOperationsWeeklyReports
there does not appear to be a report this week?
The procedures foresee a broadcast message sent to affected people (for LHCb this is the lhcb-production mailing list). We didn't receive any message. It would be nice to understand the reason of that. Being the procedure very well defined (and then the possibility of errors from the sysadmin side minimized) I tend to believe that the broacast tool didn't work properly this time causing some perturbation in the daily activity of LHCb. Can relevant people (maintaining these tools) look into that?