EGEE-WLCG-OSG operations meeting
5th November 2007
Agenda
The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=23372
Attendance:
OSG grid operations:........
Rob Quick
EGEE
OCC
/ CERN ROC:........... John Shade, Dusan Vudragovic
French ROC:.................... Gilles Mathieu, Pierre
Girard, Helene Cordier
German/Swiss ROC:......... Sven Herrmann, Helmut Dress
Italian ROC:...................... Alessandro Cavalli,
Alessandro Paolini
Russian ROC:................... Lev Shamardin
South
South
UK/Ireland ROC:............... Jeremy Coles
GGUS:............................. Helmut Dress
OSCT:..............................
WLCG
WLCG service coord.........
WLCG Tier 1 Sites
ASGC:............................. (apologise)
BNL:................................ Absent
CERN site:....................... John Shade
FNAL:.............................. Joe Kaiser
FZK:................................
Sven Hermann
IN2P3:.............................. Pierre Girard
INFN:............................... Alessandro Cavalli,
Alessandro Paolini
NDGF:.............................
PIC:................................. Gonzalo Merino
RAL:................................ Jeremy Coles
Sara/NIKHEF:................... Jules Wolfrat
TRIUMF:...........................
VOs
Alice:...............................
ATLAS:............................ Graeme Stewart,
Alessandro Di Girolamo
BioMed............................
CMS:...............................
LHCb:..............................
Reports were not received from:
- VOs:...................................... Alice, CMS, BioMed , LHCb
- EGEE ROCs (prod sites):.. RU, SEE, AP
(apologise)
- EGEE ROCs (PPS sites):.. AP
(apologise), IT, RU, SWE, SEE
Feedback on
last meeting's minutes
No comments during the meeting
EGEE Items
Grid-Operator-on-Duty
handover
From:
UK/Ireland / SouthWesternEurope
To: CentralEurope / Italy
Issues:
No particular
issues to report for this hand-over
PPS reports
Extract from agenda:
Issues from EGEE ROCs:
1. Nothing to report
Release News:
1. gLite 3.1.0 PPS Update08:
pre-deployment tests passed and deployed to the remaining PPS sites; The
release contains (patch numbers):
-o 1233 R3.1 FTS update (glite-data_R_3_1_35_1)
-o 1255 JobWrapper tests - new version with no
R-GMA dependencies
-o 1381 New version of lcg-tags with better
error reporting
-o 1382 New version of lcg-info with support
for VOViews, sites and services
-o 1383 lcg-CE for glite
3.1
-o 1384 Updated Torque (2.1.9-4) and Maui (3.2.6p19-4)
-o 1393 gLite 3.1 TORQUE_utils
(slc4/ia32)
-o 1394 gLite 3.1 TORQUE_server
(slc4/ia32)
-o 1413 glite-yaim-core 4.0.1 for the 3.1
repository
-o 1415 glite-yaim-clients 4.0.1 for the 3.1
repository
Goncalo
(SWE): Is there a strategy for the roll-out
of the SLC4 WMS in production?
Antonio: Not yet, we take the action to start making a plan
EGEE issues
coming from ROC reports
1. (NE - site RUG): a. I finally managed to solve the problem with
missing accounting information. I found out that our site had not been added to
the acl of the central R-GMA repository. Apparently
this acl is not synchronised with the information in GOCdb. I think this is problematic. At least it took me a
lot of work finding out that that was the problem.
Question from OCC to GOC: is
the registration to the R-GMA registry for new sites a step to be documented in
the ROCs registration procedures?
b.
Most of the work related to the issue above was caused by poor handling of the
ticket. It was not possible for me to keep the problem in GGUS, which should
have been possible. See the ticket for further details:
https://gus.fzk.de/ws/ticket_info.php?ticket=27104
Reply from GGUS: the current configuration of the GGUS system would
have indeed allowed this particular ticket to be solved in a different way by
the GGUS "standard" support units. Specifically, with reference to
the action on the 8th-Oct, the RUC
A nice overview of the GUSS process, with particular stress on the interaction
with the local helpdesks (Slides 4, 8) can be found in
http://indico.cern.ch/contributionDisplay.py?contribId=76&sessionId=24&confId=3580
Any suggestions to improve the process are of course welcome. Suggestions and
recommendations can be posted in
https://savannah.cern.ch/support/?func=additem&group=esc
Pierre
(
Jeremy
(
2. (NE - site SARA) GGUS ticket 18826 is assigned to
SAM/SFT support team and confirmed on April 16, but nothing seems to happen,
only from time to time the site is asked if the problem still exists, see:
https://gus.fzk.de/pages/ticket_details.php?ticket=18826
Comment (SAM):Maarten Litmaath is working on this issue
3. (CERN - site CERN-PROD): the command
glite-wms-job-status -all
meant to retrieves all jobs of the user that are in a certain status, is not
working at CERN-PROD. As a follow-up of a GGUS ticket
(https://gus.fzk.de/ws/ticket_info.php?ticket=27455) the WMS service admins replied opening a bug in
...
In Chapter 6.3.2 'Job Operations', the 'gLite 3 User
Guide' explains that for the -all option to work, it is necessary that an index
by owner is created in the LB server; otherwise, the command will fail, since
it will not be possible for the LB server to identify the user's jobs. Such
index can only be created by the LB server administrator, as explained in
http://server11.infn.it/workload-grid/docs/DataGrid-01-TEN-0118-1_2.pdf
However, it seems that this feature is not enable by yaim
. We don't want to do it by hand. If we allow users to do this action, it
should be configured automatically by yaim (for each
user ?!?).
...
I really think that the --all option is nasty since it queries the whole
database, and it could slow down the performance of the LB node if several
users use it at the same time. It would be nice if the developers could make
some comments and decide what to do with the --all option.
Question from OCC to site admins:Does anyone
ever experienced overload of the LB database due to concurrent use of the
command glite-wms-job-status -all
Question from OCC to HEP VOs:Does any of the HEP VOs plan to make extensive use of this feature?
Atlas and CMS replied (offline)
that they have no plan to use this feature
4. (CE - IRB) Atlas VO VOView
problem. There are sites from CE listed as failing for which the reason is not
clear e.g. egee.irb.hr. Could atlas VO provide a
source code of the script which generates the list of failing sites? That would
help us to understand what is the problem with these sites.
Reply (Simone Campana from
Atlas):
The code is in
/afs/cern.ch/user/c/campanas/public/VOVIEW
There is a .txt file with a query you should run on the BDII to gather the
relevant info. In addition there is a python script which fetches info from the
output of the query and generates and output, where sites marked with ==>
are the problematic ones.
I did re-run the query and publish the results on
http://voatlas01.cern.ch/atlas/data/VOViewProblem.log
The site mentioned below is not in the list, so it looks fine. There are
currently 65 queues with problems. It would be nice if some action could be
taken, the situation is persisting since > than 1 month.
5. (CE - CYFRONET)
Comment (OCC):Question moved to
CIC Portal. Point addressed later in this agenda
6. (CE) Comment from CE sites about expiration of SL3
service after SL4 version is released. SA3 proposed 1 month time which seems
reasonable in case there are no issues with SL4 service which prevents
migration. We suggest a procedure in which sites will be able to raise issues
which prevents them to migrate to SL4 version of the service and extends the
SL3 expiration date until the issues with SL4 are solved. The procedure could
be like that:
Comment (OCC):Suggestion
forwarded to SA3. SA3 accepts the suggestion and in general agree with that
Marcin(CE): The main point is to give the
possibilities for sites to raise issues for SL4. Details will come out later.
7. (UKI - RAL T1) There have been a few recent cases
where the gstat page shows a site in maintenance
while the GOCDB does not have any listed downtime (for one example see the RAL
Tier-1 case in GGUS ticket 28520).
Reply(gstat):There
is a problem with the maintenance module for gstat.
We have disabled it for now so we can fix it and put it back into production.
This stems from a problem that we have with the way we formulate the sql query to gocdb.
8. (UKI) Several sites have seen recent SAM sft-job failures relating to downloading from RBs. Errors like "Cannot download X from
gsiftp://rb115.cern.ch" where X is usually .BrokerInfo
or the tarball of SAM tests are being seen. Is this
evident in other EGEE regions and what is behind it?
Kostas:
We will check among our recent failures
9. (IT - CNAF) error: FTA_GLOBAL_DB_PASSWORD in site-info.def (very) possible reason: bug in yaim
that doesn't change the password in
/etc/tomcat5/Catalina/localhost/glite-data-transfer-fts.xml according to the
variable FTA_GLOBAL_DB_PASSWORD.
Question (OCC)Version of FTS? Was a bug reported for that?
(IT - CNAF) answer: only reported to FTS developers. Will open not
later than tomorrow.
10. (IT - CNAF) Sam tests: we had a configuration
problem with CE (ce05-lcg.cr.cnaf.infn.it), some SAM tests failed, and FCR
(Freedom of Choices for Resources used by CMS) took off the CE from BDII. Even
if we fixed fastly the CE, it was considered down for
many hours, it seems that the "JS" result needed too much time to get
correctly published after all single tests were successful. Because of this,
the old JS ERROR status was still true, causing FCR to exclude our CE even if everything
was ok. Is there a possible workaround for this? Has this already been raised
by some other SITE/ROC?
Antonio: Please submit a GGUS ticket or that
Additional
points and precisions about the new SD procedure
Gilles (CIC Portal): The new version of
the CIC portal automates the notifications sent when a downtime is scheduled.
The request is to provide the information why are the people notified about SD.
A discussion is on going with the GOCDB team about definition of core node level;
when site goes to schedule/unscheduled downtime, existence of sites core
service node should be checked and notification should be sent
If a service is an essential core service then all ROCs
should be notified, but if it is essential only for specific VO or ROC, only
specific people should be notified
We need to distinguish between core node (already exists at GOCDB) and core
service (does not exist at GOCDB yet). The final aim is to automate the
notifications as much as possible in order to reduce the risk for human error
in selecting targets for the broadcasts.
Antonio: Sometimes a downtime could be of interest
for particular extemporary targets. So maybe further discussion is needed about
whether some space for human intervention should be left in the.system
SAM: new
security test in Validation
What the test does is:
-1. reads all the env vars in the WN account where
the job runs.
-2. for each directory/file
specified in any variable, and for each file inside any of those directories:
-o the test checks if the
file/dir has write privileges for the Other group (the --------X- bit).
-o if the file/dir is in $PATH,
it returns an ERROR
-o if the file/dir is not in
$PATH, it returns a WARNING
-o if no 'w' privilege was found
in any of the files/dirs, the test returns OK.
Validation instance of SAM portal
is available here:
- https://lcg-sam-val.cern.ch:8443/sam-val/sam.py
Documentation
temporarily available in
- https://lxn1181.cern.ch:8443/sam-val/docs/CE-wn-sec-fp.html
In a near future it will be
linked to the rest of the tests in production:
- http://grid.cyfronet.pl/sam-doc/masterindex.html
WLCG Items
WLCG issues coming from ROC reports
None this week.
Upcoming WLCG Service Interventions
1. [Announcement] FZK
(Tier1 GridKa): Scheduled downtime for maintenance on
November 6, 8:00-22:00 UTC (9:00-23:00 CET). Upgrade to dCache
1.8. - All VO's using the GridKa
SE are affected. Data transfers are stopped during this period.
FTS service review
See agenda for reports.
ATLAS service
Problem of the VOView
consistency. Progresses registered on action 67. See point 4) in EGEE ROC
reports.
CMS service
No report. No representative present.
LHCb
service
No report. No representative present.
No report. No representative present.
Service Challenge Coordination
OSG Items
Rob
A change in VOMS (announced in several EGEE broadcasts) caused OSG some amount
of distress, basically a communication flaw between the EGEE Broadcast tool and
the OSG was discovered. We ask for goc@opensciencegrid.org to be added to the
target of the egee broadcast
Helene:
No problem with that, but we need to define first how we are going to achieve
that in order to make sure that OSG actually get the information they need.
We need to identify OSG as part of one of
the current targets as in the current form:
https://cic.gridops.org/index.php?section=roc&page=broadcast
-- ROC, Tier1.... included --
Or as part of a new target -- project -- .
We cannot let people decide whether or not
include the OSG target in the broadcast otherwise we are likely to be getting
into another communication hole.
Rob:
Waiting for the correct placement of OSG to be defined, Id like the list to be
added however in all the communication flow. I could filter the messages for
some time as a temporary measure
Maria:
As far as the VOMS/VOMRS services are concerned, we are however going to add goc@opensciencegrid.org in the list
of our special target (always in CC in our broadcasts)
Review of
action items
The updated list action items can be
found attached to the agenda.
AOB
Graeme (Atlas): Sites which are upgraded to SL(C)4 x86_64,
ATLAS need to have a 32bit python available so that they can access the 32bit
(only) LFC plugin while bootstrapping the job. The
only sensible way that has been found is to have, on 64bit sites, the 32bit
version of python called "python32". Sites which have upgraded to
SL(C) i386 shouldnt be changed regarding this. The twiki link explaining how to do this is here: https://twiki.cern.ch/twiki/bin/view/Atlas/RPMcompatSLC4 (in particular https://twiki.cern.ch/twiki/bin/view/Atlas/RPMcompatSLC4#Running_ATLAS_production_at_Grid
).
Next
Meeting
The next meeting will be Monday, 12th November
2007 14:00 UTC (16:00 Swiss local time).
Attendees
can join from 13:45 UTC (15:45 Swiss local time) onwards.
The
meeting will start promptly at 14:00 UTC.
The WLCG section will start at the fixed time of 16:30.