WLCG-OSG-EGEE Operations meeting

Monday 31 Jul 2006, 14:00 → 17:30 Europe/Zurich

28-R-15 (VRVS (Saturn room))

28-R-15

VRVS (Saturn room)

Nick Thackray

Description

VRVS "Saturn" room will be available 15:30 until 18:00 CET

- 14:00 → 17:25
  28-R-15
  
  28-R-15
  - 16:00
    
    Feedback on last meeting's minutes 5m
    
    Minutes
  - 16:05
    
    Grid-Operator-on-Duty handover 5m
  - From Central Europe (backup: Taiwan) to SouthEasternEurope (backup: Italy)

16:10

SC4 weekly report and upcoming activities 10m

See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans

Speaker: Harry Renshall

16:20

Information system unstabilities 5m

It has been noticed that some information has been periodically disappearing from the information system. This happens due to the various time-outs. There is a time-out on the top-level BDII querying the site-level BDII. The site-level BDII querying the GRIS and the information provider. Time-outs are there for a reason, they protect the system from queries that are taking far too long to return under normal circumstances.
We need to improve the monitoring of the information in the information system to spot these kind of things. As a rule of thumb, if the whole site disappears from the information system it is a time-out while querying the site-bdii. If only one service disappears, it is usually and information provider. If we spot some information disappearing, it needs to be investigated as it usually points to some low level fabric related problems.
FZK seems to be the worst site at the moment

Speaker: Laurence Field

16:25

New set of updates to gLite 3.0 5m

A new set of updates to gLite 3.0 has been passed certification. This represents many bugfixes and Glue 1.2 support for the gLite WMS. They will be released to Pre-Production later today. After one week, and depending of the results, it will be sent to production.

16:30

Issues to discuss from reports 25m

Reports were not received from
- ROCs: Russia
- Tier-1: BNL, FNAL, NDGF, TRIUMF
- VOs: Atlas, LHCb, Biomed

1. Central Europe: Questions regarding gLite versioning for updates and releases discussed on the last meeting.
- will there be "major" updates (i.e. big updates of all software components) aside from those "continuous/small" updates?
- do you plan to use certification at ROCs for those "small" updates? We'd be eager to take part in that.

2. DECH: Reports from some (how many?) sites are not visible from the ROC View on the reports. But from the Site View it can even be consulted in the portal. Therefore, most probably this report is not complete!

3. DECH: Is there an update of the planning for migrating to SL4? (Will there be central approach to update the OS? When will the MW be ready for SL4?)

4. DECH: Information from experiments data challenges including dteam transfers strongly reduced. What is going on?

5. DECH: Since the JRA1 website http://egee-jra1-dm.web.cern.ch/egee%2Djra1%2Ddm/ has not been updated we developed for some weeks against gLite I/O Server and Fireman Catalogue. Only recently we learned that those services will be replaced by GFAL Library and LFC most probably. GSI is developing a ROOT - gLite interface.

6. Italy: Support for the Classic SE will be stopped in autumn. Lots of sites installed classic SE and MOM box on the same machine, but it is not possible to install DPM and MOM box together because of a conflict on port 8443 between rgma and dpm. Generally different service should run on different port to avoid, or it should be possible to change their defaul port in an easy way. At least for small sites, it is very important to run different grid elements on a single machine.

7. UKI: As a ROC manager I check the remarks made by sites prior to validating the report. Although the comments are listed by site would it be possible for the authors CN to be stamped against each comment? I will ask all UK staff to add their name at the bottom of each report for future reports anyway. This should make following up issues easier.

8. US: Location of the OSG site Scheduled Downtime web page for SAM

VO REPORTS:

Alice: i will report in terms of T0-T1 transfers only since this is the most important topic at this moment:
1. CERN-CCIN2P3: working
2. CERN-CNAF: Still not working because of castor problems at the site
3. CERN-FZK: srm is not properly configured in the VOBOX. Still waiting for this configuration. site is aware of the problem.
4. CERN-RAL: resources and srm endpoints not provided
5. CERN-SARA: Still has not been tested.

CMS: CMS began official production last week. Several issues about local site configuration issues for experiment specific items. The location in the storage element the site expects CMS to stage files out locally. We are working with individual sites and there is good progress.
Power cut had significant impact on central experiment services. Will work within the CMS LCG taskforce to revisit the placement of experiment critical central systems

16:35

Review of action items 15m

17:20

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

VRVS (Saturn room)

28-R-15