EGEE-WLCG-OSG operations meeting

5th November 2007

Agenda

The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=23372

 

Attendance:

OSG grid operations:........ Rob Quick

EGEE

Asia Pacific ROC:............. (apologise)

Central Europe ROC:........ Marcin Radecki

OCC / CERN ROC:........... John Shade, Dusan Vudragovic

French ROC:.................... Gilles Mathieu, Pierre Girard, Helene Cordier

German/Swiss ROC:......... Sven Herrmann, Helmut Dress

Italian ROC:...................... Alessandro Cavalli, Alessandro Paolini

Northern Europe ROC:...... Jules Wolfrat

Russian ROC:................... Lev Shamardin

South East Europe ROC:.. Kostas Koumantaros

South West Europe ROC:. Gonzalo Merino

UK/Ireland ROC:............... Jeremy Coles

GGUS:............................. Helmut Dress

OSCT:..............................

WLCG

WLCG service coord.........

WLCG Tier 1 Sites

ASGC:............................. (apologise)

BNL:................................ Absent

CERN site:....................... John Shade

FNAL:.............................. Joe Kaiser

FZK:................................ Sven Hermann

IN2P3:.............................. Pierre Girard

INFN:............................... Alessandro Cavalli, Alessandro Paolini

NDGF:.............................

PIC:................................. Gonzalo Merino

RAL:................................ Jeremy Coles

Sara/NIKHEF:................... Jules Wolfrat

TRIUMF:...........................

VOs

Alice:...............................

ATLAS:............................ Graeme Stewart, Alessandro Di Girolamo

BioMed............................

CMS:...............................

LHCb:..............................

 

Reports were not received from:

VOs:...................................... Alice, CMS, BioMed , LHCb

-  EGEE ROCs (prod sites):.. RU, SEE, AP (apologise)

- EGEE ROCs (PPS sites):.. AP (apologise), IT, RU, SWE, SEE

Feedback on last meeting's minutes

No comments during the meeting

EGEE Items

Grid-Operator-on-Duty handover

From: UK/Ireland / SouthWesternEurope

To: CentralEurope / Italy

 

Issues:

No particular issues to report for this hand-over

 

PPS reports

Extract from agenda:

 

Issues from EGEE ROCs:

1. Nothing to report

Release News:

1. gLite 3.1.0 PPS Update08: pre-deployment tests passed and deployed to the remaining PPS sites; The release contains (patch numbers):
-o  1233 R3.1 FTS update (glite-data_R_3_1_35_1)
-o  1255 JobWrapper tests - new version with no R-GMA dependencies
-o  1381 New version of lcg-tags with better error reporting
-o  1382 New version of lcg-info with support for VOViews, sites and services
-o  1383 lcg-CE for glite 3.1
-o  1384 Updated Torque (2.1.9-4) and Maui (3.2.6p19-4)
-o  1393 gLite 3.1 TORQUE_utils (slc4/ia32)
-o  1394 gLite 3.1 TORQUE_server (slc4/ia32)
-o  1413 glite-yaim-core 4.0.1 for the 3.1 repository
-o  1415 glite-yaim-clients 4.0.1 for the 3.1 repository

 

Goncalo (SWE): Is there a strategy for the roll-out of the SLC4 WMS in production?

Antonio: Not yet, we take the action to start making a plan

EGEE issues coming from ROC reports

1. (NE - site RUG): a. I finally managed to solve the problem with missing accounting information. I found out that our site had not been added to the acl of the central R-GMA repository. Apparently this acl is not synchronised with the information in GOCdb. I think this is problematic. At least it took me a lot of work finding out that that was the problem.
Question from OCC to GOC: is the registration to the R-GMA registry for new sites a step to be documented in the ROCs registration procedures?
b. Most of the work related to the issue above was caused by poor handling of the ticket. It was not possible for me to keep the problem in GGUS, which should have been possible. See the ticket for further details: https://gus.fzk.de/ws/ticket_info.php?ticket=27104
Reply from GGUS: the current configuration of the GGUS system would have indeed allowed this particular ticket to be solved in a different way by the GGUS "standard" support units. Specifically, with reference to the action on the 8th-Oct, the RUC UK could have contacted the local support mailing list instead of asking the site admin to do it. On the other hand, GGUS is not supposed to replace existing support infrastructures, but rather to coordinate the support effort. Calls to local helpdesks are not forbidden by the process (and sometimes they may even turn out to accelerate the resolution of the issues), part of the added value of GGUS is to make sure that once a problem ticket is opened within GGUS it is actually followed-up by supporters at all levels and brought to a resolution in a convenient time, independently by SLAs which may or may not be set up at the level of the local helpdesks.
A nice overview of the GUSS process, with particular stress on the interaction with the local helpdesks (Slides 4, 8) can be found in
http://indico.cern.ch/contributionDisplay.py?contribId=76&sessionId=24&confId=3580
Any suggestions to improve the process are of course welcome. Suggestions and recommendations can be posted in
https://savannah.cern.ch/support/?func=additem&group=esc

Pierre (France): I do not agree with this reply. In my opinion the local support should be integrated with GGUS in a way that allows to track all the work being done at the different levels.

Jeremy (UK): However, the local support system of ROC UKI is strongly integrated with GGUS, so updates are transmitted in both ways

2. (NE - site SARA) GGUS ticket 18826 is assigned to SAM/SFT support team and confirmed on April 16, but nothing seems to happen, only from time to time the site is asked if the problem still exists, see: https://gus.fzk.de/pages/ticket_details.php?ticket=18826

Comment (SAM):Maarten Litmaath is working on this issue

3. (CERN - site CERN-PROD): the command
glite-wms-job-status -all
meant to retrieves all jobs of the user that are in a certain status, is not working at CERN-PROD. As a follow-up of a GGUS ticket (https://gus.fzk.de/ws/ticket_info.php?ticket=27455) the WMS service admins replied opening a bug in Savannah (https://savannah.cern.ch/bugs/?30989 ) saying:

...
In Chapter 6.3.2 'Job Operations', the 'gLite 3 User Guide' explains that for the -all option to work, it is necessary that an index by owner is created in the LB server; otherwise, the command will fail, since it will not be possible for the LB server to identify the user's jobs. Such index can only be created by the LB server administrator, as explained in http://server11.infn.it/workload-grid/docs/DataGrid-01-TEN-0118-1_2.pdf However, it seems that this feature is not enable by yaim . We don't want to do it by hand. If we allow users to do this action, it should be configured automatically by yaim (for each user ?!?).
...
I really think that the --all option is nasty since it queries the whole database, and it could slow down the performance of the LB node if several users use it at the same time. It would be nice if the developers could make some comments and decide what to do with the --all option.

Question from OCC to site admins:Does anyone ever experienced overload of the LB database due to concurrent use of the command glite-wms-job-status -all
Question from OCC to HEP VOs:Does any of the HEP VOs plan to make extensive use of this feature?

Atlas and CMS replied (offline) that they have no plan to use this feature

4. (CE - IRB) Atlas VO VOView problem. There are sites from CE listed as failing for which the reason is not clear e.g. egee.irb.hr. Could atlas VO provide a source code of the script which generates the list of failing sites? That would help us to understand what is the problem with these sites.
Reply (Simone Campana from Atlas):
The code is in
/afs/cern.ch/user/c/campanas/public/VOVIEW
There is a .txt file with a query you should run on the BDII to gather the relevant info. In addition there is a python script which fetches info from the output of the query and generates and output, where sites marked with ==> are the problematic ones.
I did re-run the query and publish the results on
http://voatlas01.cern.ch/atlas/data/VOViewProblem.log
The site mentioned below is not in the list, so it looks fine. There are currently 65 queues with problems. It would be nice if some action could be taken, the situation is persisting since > than 1 month.

5. (CE - CYFRONET) CIC SD notification. It seems that some downtime notifications are wrongly addressed. e.g. admins of CYFRONET-IA64 sites received notification about T2_Estonia downtime. It would be helpful if the tool would tell why it considers the recipient as "relevant" for the message. That would help to find out if this was wrongly addressed by the SD submitter or this is a bug in the SD web form.
Comment (OCC):Question moved to CIC Portal. Point addressed later in this agenda

6. (CE) Comment from CE sites about expiration of SL3 service after SL4 version is released. SA3 proposed 1 month time which seems reasonable in case there are no issues with SL4 service which prevents migration. We suggest a procedure in which sites will be able to raise issues which prevents them to migrate to SL4 version of the service and extends the SL3 expiration date until the issues with SL4 are solved. The procedure could be like that:

      1. SA3 announces expiration of SL3 service version
      2. sites have some time e.g. one week to raise issues with SL4
      3. if no issues the expiration is accepted and in 1 month time SL3 version expires
      4. if there were issues they need to be addressed by SA3 and expiration of SL3 is extended

Comment (OCC):Suggestion forwarded to SA3. SA3 accepts the suggestion and in general agree with that

Marcin(CE): The main point is to give the possibilities for sites to raise issues for SL4. Details will come out later.

7. (UKI - RAL T1) There have been a few recent cases where the gstat page shows a site in maintenance while the GOCDB does not have any listed downtime (for one example see the RAL Tier-1 case in GGUS ticket 28520).
Reply(gstat):There is a problem with the maintenance module for gstat. We have disabled it for now so we can fix it and put it back into production. This stems from a problem that we have with the way we formulate the sql query to gocdb.

8. (UKI) Several sites have seen recent SAM sft-job failures relating to downloading from RBs. Errors like "Cannot download X from gsiftp://rb115.cern.ch" where X is usually .BrokerInfo or the tarball of SAM tests are being seen. Is this evident in other EGEE regions and what is behind it?

Kostas: We will check among our recent failures

9. (IT - CNAF) error: FTA_GLOBAL_DB_PASSWORD in site-info.def (very) possible reason: bug in yaim that doesn't change the password in /etc/tomcat5/Catalina/localhost/glite-data-transfer-fts.xml according to the variable FTA_GLOBAL_DB_PASSWORD.
Question (OCC)Version of FTS? Was a bug reported for that?
(IT - CNAF) answer: only reported to FTS developers. Will open not later than tomorrow.

10. (IT - CNAF) Sam tests: we had a configuration problem with CE (ce05-lcg.cr.cnaf.infn.it), some SAM tests failed, and FCR (Freedom of Choices for Resources used by CMS) took off the CE from BDII. Even if we fixed fastly the CE, it was considered down for many hours, it seems that the "JS" result needed too much time to get correctly published after all single tests were successful. Because of this, the old JS ERROR status was still true, causing FCR to exclude our CE even if everything was ok. Is there a possible workaround for this? Has this already been raised by some other SITE/ROC?

Antonio: Please submit a GGUS ticket or that

Additional points and precisions about the new SD procedure

Gilles (CIC Portal): The new version of the CIC portal automates the notifications sent when a downtime is scheduled. The request is to provide the information why are the people notified about SD. A discussion is on going with the GOCDB team about definition of core node level; when site goes to schedule/unscheduled downtime, existence of sites core service node should be checked and notification should be sent
If a service is an essential core service then all ROCs should be notified, but if it is essential only for specific VO or ROC, only specific people should be notified
We need to distinguish between core node (already exists at GOCDB) and core service (does not exist at GOCDB yet). The final aim is to automate the notifications as much as possible in order to reduce the risk for human error in selecting targets for the broadcasts.

Antonio: Sometimes a downtime could be of interest for particular extemporary targets. So maybe further discussion is needed about whether some space for human intervention should be left in the.system

SAM: new security test in Validation

What the test does is:

-1. reads all the env vars in the WN account where the job runs.

-2. for each directory/file specified in any variable, and for each file inside any of those directories:

-o the test checks if the file/dir has write privileges for the Other group (the --------X- bit).

-o if the file/dir is in $PATH, it returns an ERROR

-o if the file/dir is not in $PATH, it returns a WARNING

-o if no 'w' privilege was found in any of the files/dirs, the test returns OK.

Validation instance of SAM portal is available here:

- https://lcg-sam-val.cern.ch:8443/sam-val/sam.py

Documentation temporarily available in

- https://lxn1181.cern.ch:8443/sam-val/docs/CE-wn-sec-fp.html

In a near future it will be linked to the rest of the tests in production:

      - http://grid.cyfronet.pl/sam-doc/masterindex.html

 

 

WLCG Items

WLCG issues coming from ROC reports

None this week.

Upcoming WLCG Service Interventions

1. [Announcement] FZK (Tier1 GridKa): Scheduled downtime for maintenance on November 6, 8:00-22:00 UTC (9:00-23:00 CET). Upgrade to dCache 1.8. - All VO's using the GridKa SE are affected. Data transfers are stopped during this period.

FTS service review

See agenda for reports.

ATLAS service

Problem of the VOView consistency. Progresses registered on action 67. See point 4) in EGEE ROC reports.

CMS service

No report. No representative present.

LHCb service

No report. No representative present.

ALICE service

No report. No representative present.

Service Challenge Coordination

  • WLCG Service Reliability workshop, CERN, November 26 - 30 - agenda - wiki
  • Common Computing Readiness Challenge - CCRC'08 - meetings page
  • ATLAS throughput tests finished and M5 detector cosmics now running till 5 November. Data export from CERN later in the week.
  • CMS CSA07 now to continue till mid-November.

OSG Items

Rob A change in VOMS (announced in several EGEE broadcasts) caused OSG some amount of distress, basically a communication flaw between the EGEE Broadcast tool and the OSG was discovered. We ask for goc@opensciencegrid.org to be added to the target of the egee broadcast

Helene: No problem with that, but we need to define first how we are going to achieve that in order to make sure that OSG actually get the information they need.

We need to identify OSG as part of one of the current targets as in the current form:

https://cic.gridops.org/index.php?section=roc&page=broadcast

-- ROC, Tier1.... included --

Or as part of a new target -- project -- .

We cannot let people decide whether or not include the OSG target in the broadcast otherwise we are likely to be getting into another communication hole.

Rob: Waiting for the correct placement of OSG to be defined, Id like the list to be added however in all the communication flow. I could filter the messages for some time as a temporary measure

Maria: As far as the VOMS/VOMRS services are concerned, we are however going to add goc@opensciencegrid.org in the list of our special target (always in CC in our broadcasts)

Review of action items

The updated list action items can be found attached to the agenda.

AOB

Graeme (Atlas): Sites which are upgraded to SL(C)4 x86_64, ATLAS need to have a 32bit python available so that they can access the 32bit (only) LFC plugin while bootstrapping the job. The only sensible way that has been found is to have, on 64bit sites, the 32bit version of python called "python32". Sites which have upgraded to SL(C) i386 shouldnt be changed regarding this. The twiki link explaining how to do this is here: https://twiki.cern.ch/twiki/bin/view/Atlas/RPMcompatSLC4 (in particular https://twiki.cern.ch/twiki/bin/view/Atlas/RPMcompatSLC4#Running_ATLAS_production_at_Grid ).

Next Meeting

The next meeting will be Monday, 12th November 2007 14:00 UTC (16:00 Swiss local time).

Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards. 

The meeting will start promptly at 14:00 UTC.

The WLCG section will start at the fixed time of 16:30.