WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2006-06-26T14:00:00+02:00
End: 2006-06-26T17:30:00+02:00
Location: VRVS (Motorcycle room)

Monday 26 Jun 2006, 14:00 → 17:30 Europe/Zurich

28-R-15 (VRVS (Motorcycle room))

28-R-15

VRVS (Motorcycle room)

Maite Barroso

Description

VRVS "Motorcycle" room will be available 15:30 until 18:00 CET

- 28-R-15
  
  28-R-15
  - 1
    
    Feedback on last meeting's minutes
    
    Minutes
  - 2
    
    Grid-Operator-on-Duty handover
  - From SouthEasternEurope (backup: CERN) to France (backup: CentralEurope)

Issues to discuss from reports

Reports were not received from [Biomed, CMS]

R-GMA critical bug and fix (ROC coordination)
services version 5.0.22/3 - immediate action stop tomcat - longer term make sure rpm is upgraded and restart tomcat
cs-grid3.bgu.ac.il, dgbdii0.icepp.jp, grid-ui.physik.uni-wuppertal.de, mon1.egee.fr.cgg.com, tbat02.nipne.ro, lxn1188.cern.ch
Connection timed out cannot determine version - immediate action stop tomcat - longer term make sure rpm is upgraded and restart tomcat
grid01.uibk.ac.at, egeemon.ifca.org.es, grid005.ct.infn.it, rgmamon.lnl.infn.it, gridit002.pd.infn.it, CE.pakgrid.org.pk, testbed004.grid.ici.ro, lemon.grid.kiae.ru, lcg13.sinp.msu.ru, niugmon.grid.niu.edu.tw
Connection refused cannot determine version - Tomcat not running, no immediate action required - longer term make sure rpm is upgraded and restart tomcat
mon01.pic.es, mon-lcg.projects.cscs.ch, cclcgmoli01.in2p3.fr, epgmo1.ph.bham.ac.uk

New CA version (ROC coordination)
CentralEurope egee.man.poznan.pl
Italy INFN-BOLOGNA-CMS
CERN TORONTO-LCG2
UKI csTCDie
UKI UKI-NORTHGRID-MAN-HEP
UKI cpDIASie
UKI mpUCDie

Folowing sites (that are seen as OK by SFTs) did upgrade, but seems that the update did not go well for new CAs (they probably did apt-get upgrade) : sites where ca_fnal_KCA is still the old one :
SouthWesternEurope UAM-LCG2
Russia RU-Protvino-IHEP
SouthWesternEurope ifae
SouthWesternEurope LIP-LCG2

Special site :
AsiaPacific Taiwan-IPAS-LCG2 --> wrong (old ?) ca_NorduGrid and ca_pkIRISGrid CAs

Some sites are experiencing APEL error in SFT * but apel parser scripts run without problems * still investigating this issue (AsiaPacific)

Middleware update announcements should be given at least couple of days before the update becomes CT!!! We should have couple of days to deal with it in the normal way. At our site there is a separate APT server which updates worker nodes. Synchronization of the server with the repository is done manually and can be done only during working hours. (CentralEurope)

CIC Portal (DECH):
* Currently, the view of site reports (as shown for the ROC) does not seem to be too reliable (site referencing to vanished report) - Where is the weekly report from sites actually shown now, after the portal update?
* Features missing (view of RC reports, who has filled the report (green/red check marked list for sites missing)
* Times in reports in CIC-Portal still not coherent: "Is it possible that the failure time is wrong? Why does the all rgma failure start at about 2:30?"
*

Coordination of solution for gridftp problems at Tier1s? (DECH) quote "See report above: several T1 sites experiencing problems with gridftp connections being dropped after some time running. Please ask for proper coordination to locate the course of the problem."

Regarding the R-GMA update: If an urgent update requires reconfiguration of the component released, this should be clearly stated in the corresponding announcement. Also if the update changes the verision of the release this should be clearly indicated to avoid confusion. (SouthEasternEurope)

Site admins in Romania compained last week that they were not reckongized as Sites admins in the cic portal (SouthEasternEurope) Has this been reported through GGUS?

Communication problem: The Portuguese CA CRL was not updated at the CERN VOMS server. This prevented any LIP user to get a proxy. A ticket was open by the LIP people, and nothing happened for more than 1 week. Mails were also sent to Maria et al. Finally, after sending a mail to the LCG-Rollout list, the problem was solved. This was a critical problem since prevented users from a country to get a voms proxy. The issue took too long to be addressed. Opening a GGUS ticket did not help. (SouthWest)

Announcement of updates and patches to the release. (SouthWest) It would be good that the update announcement includes a line specifying: a) this applied to production or pps. b) the upgrade implies just installing new rpms or needs some service configuration. c) If immediate action it is supposed to be taken by sysadmins, may be it is also good to state it. May be a field "Priority" should be added to the Broadcast messages. May be sending these mails to special list like glite-announce would help. Also, adding an SFT that makes sites that didn't apply the patch to fail could be useful, for critical and urgent patches.

Availability of a production gLite RB at CERN, replacing rb103 (LHCb): The rb103.cern.ch, the default glite WMS that you were getting by default on lxplus was far away to be considered a production-quality service. From the presentation of the tests done by LHCb about glite WMS (see http://indico.cern.ch/materialDisplay.pycontribId=17&sessionId=8&materialId=slides&confId=397 (SLIDE 17)) you see that tests over this machine were extremely worse than others over other RBs located at CNAF. The RB has been retired from the production:(see ticket GGUS:#9702 and #9706) although LHCb actually sees at least two problems: 1. first of all the machine has been removed from the list of "good" RB but the list is now empty. No way to submit glite jobs (for testing) through the CERN production. 2. The RB, before being advertized to be a production-like quality machine should go through a more strong certification process. The risk you might run on is the same as the recent experience of LHCb: a too generic (and unfair) evaluation of the new WMS glite middleware while problems were mainly due to wrong configuration and setting of the service it self.

ATLAS observed a lot of instabilities in the information system in the last 2 weeks. A lot of jobs were piled in french sites 2 weeks ago, most likely due to the BDII not containing the most up-to-dated informations. Also, measurements of number of CPUs from the BDII varies with a frequency of 2 minutes (BDII refresh rate). A simple monitoring tool querying the BDII every so many seconds and keeping the log of quantities like numer of computing elements, number of SEs and number of CPUs would help at least to understand how severe is the instbility.

Review of action items

Following a few requests, we'll add a "due date" column for all actions.

Upcoming SC4 Activities

See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans

a) ALICE: The most important issue for ALICE to discuss during this meeting with all the sites are the conditions of the FTS endpoints and SRM SEs

AOB

a) Next fixes/updates

Fix for GFAL info system timeout too low: https://savannah.cern.ch/bugs/index.php?func=detailitem&item_id=17738
b) Change of CA apt repository

Summary: the CA APT repository will not be changed till the default one distributed with YAIM (in site-info.def) is changed: http://savannah.cern.ch/bugs/?func=detailitem&item_id=17616 Expected timing: in 1-2 weeks

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

VRVS (Motorcycle room)

28-R-15