WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2006-07-24T14:00:00+02:00
End: 2006-07-24T17:30:00+02:00
Location: VRVS (Subway room)

Monday 24 Jul 2006, 14:00 → 17:30 Europe/Zurich

28-R-15 (VRVS (Subway room))

28-R-15

VRVS (Subway room)

Nick Thackray

Description

VRVS "Subway" room will be available 15:30 until 18:00 CET

- 14:00 → 17:20
  28-R-15
  
  28-R-15
  - 16:00
    
    Feedback on last meeting's minutes 5m
    
    Minutes
  - 16:05
    
    Grid-Operator-on-Duty handover 5m
  - From UKI ROC (backup: France ROC) to Central Europe ROC (backup: Taiwan ROC)

16:10

SC4 weekly report and upcoming activities 10m

See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans

16:20

LCG job reliability 15m

Speaker: Massimo Lamanna, Pablo Saiz

16:35

gLite versioning for updates and releases 5m

The gLite 3.0 release process involves continuous updates, in similar fashion to a Linux distro. The current setup does not envision the version of gLite changing with these updates.

The internal certification process involves batching updates. To help maintain package lists these batches are tagged, 3.0.1, 3.0.2 etc.

Point of confusion: the glite-version rpm does track this internal tag. This is an artefact of the build system and was not intended.

Current plan: have glite-version report 3.0.0 across updates and ensure that services can have their versions queried.

For this to happen we need information providers which advertise service version to the IS, and we need meta-rpms which are versioned and which have dependencies which track updates. In this case, the question 'what version of the CE do you have' can be answered through checking the information system, or querying the version of the glite-CE rpm.

Currently pre-prod and production are identical, corresponding to a single set of updates tagged 3.0.1. A new set, internally tagged 3.0.2, is in certification.

Speaker: Oliver Keeble

16:40

Issues to discuss from reports 20m

Reports were not received from
ROCs:
Tier-1s: BNL, FNAL, NDGF, TRIUMF
VOs: Atlas

1. As we can see in this prereport, several errors during this week were caused by SFT scripts. We would like to suggest that SFT errors that are caused by errors in SFT scripts, be excluded from pre-reports. Otherwise if RC is not aware of SFT script issues, he/she will just waste time investigating problems. Are there anybody doing prereport filtering? (CentralEurope)

2. SFTs are launched on nodes which are declared as "not to be monitored" in the GOC DB. As a result, the RC report is full of wrong failures. Do we have to remove from the GOC DB the nodes which are not currently in production ? We use in general several nodes to switch from one machine to another one when upgrading the m/w. This is why some node are set to be not monitored in the GOC DB. It would certainly be better to filter the nodes to be monitored as it was before with previous version of SFTs (France).

3. See ggus GGUS #8700 and GGUS #10501. It is impossible to get a clear list of rpms by service. Currently, the M/W is provided by "node" by the way of the provided meta-packages. But a node is a list of services (optionals, relocatable, or not). When you use Quattor you need to know the rmps by services rather than by "nodes". Moreover, by decomposing a node into services, it makes the m/w easier to understand and, by the way, to operate.(France).

4. java (j2sdk) has been newly included into the MW (or is it OS) repository. It required before extra config that has not been included in the j2sdk rpm : I personnaly use the alternatives system and other symlinks that had to be updated after the old versin of the j2sdk to be removed. This explains RGMASC test failing on 2006-07-14 (not reported in this page) and csh test failing (SFT happening while fixes to j2sdk changes were applied) (France)

5. It seems that SFT times in CIC reports are not coherent - some of them are in local time, some of them in GMT. The same applies to SFT history - times for SFTs initiated by AEGIS site admin are published in GMT, while for some of the official SFTs time is local, and for some of them time is published in GMT. I would like to request that a decision is made on this (preferably times in GMT), and that all parties involved are informed (SFT developers, ROCs, sites) and all sources updated (CVS for SFT, tools that initiate SFTs, etc.) so that no confusion is present (SouthEastEurope)

6. Several sft-rgma failures are seen simultaneously on many sites that can be accredited only to the central registry temporary problem. Is it possible to improve sft-rgma so that, prior to submitting SFTs, it checks if the registry is available and running (e.g. from UI at CERN, by some simple query), and if not sft-rgma is dropped from the list of tests sent to sites? This would reduce number of failures associated with each site for which actually sites are not responsible (SouthEastEurope)

7. When will te gLite SAM tests be available for OPS VO? (SouthWestEurope)

8. ALICE: Alice following the production, all problems coming from the different sites being reported to the corresponding experts. Simultaneously the scheduled transfers are continuing. Issues with all sites but CNAF that seems to be the best performance. RAL is to providing srm endpoints for Alice due to a lack of resources for this. For the rest of the sites, still hoping to see any succesfully transfer and following with Pablo Saiz the results.

9. CMS: CMS has started production for CSA06. We have experienced a few problems staging the files to local storage. This has typically been related to UNIX permissions in the directories written by SRM or problems with the local CMS configuration. We have been working on the situation site by site. We are working on loading a new version of the CMS software to the LCG sites. Production is still being shaken out, but we expect to switch to more regular job submission this week.

10. LHCb:
Point A.
current situation before pushing the buttom for DC06.
CNAF:problem accessing the Storage
GRIDKA: Maradona problem over there preventing to run reconstruction jobs for DC06(Pilot jobs aborting suddenly without consiumingCPU) Under investigation directly with site admins and LHCb representative
PIC: OK
RAL OK (but slow access on the Storage) under investigation
CERN: OK
Lyon: dcap protocol supported moved to gsidcap. Has it been annonced? They are working with an home-based solution. lcg-gt proved extremely slow.
Point B.
Still the old problem of site in Scheduled Downtime. The jobs shouldn't be steer to go there. FCR is too restrictive becasue LHCb is using order of ten RBs and not all of them are FCR-aware. Doesn't make much more sense to remove automatically from production site in SD?

17:00

Review of action items 15m

17:15

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

VRVS (Subway room)

28-R-15