Weekly Operations' Meeting Action List
Status as of 11th November 2007
Due Date colour key:
Red: action is overdue |
Yellow: action is due at or before next meeting |
White: action is due some time after the next meeting |
Grey: Action closed at the last meeting. |
Open Action
Items
# |
Raised by: |
Description |
Assigned to: |
Status |
Due date |
59 |
COD |
Andy to produce a list of possible node and site states within the GOCDB. This can be input to further discussion Update 2007-08-16: Andy has sent the information. The conversation is on-going. Update 2007-08-20: In progress. Update 2007-08-27: In progress. Update 2007-09-10: In progress. To be discussed at EGEE 07 Update 2007-09-10: Was not discussed at EGEE 07 due to lack of time. Will continue to discuss here. Update
2007-10-22: in progress. Update
2007-10-29: No one present to
comment. Update
2007-11-05: No update received |
Andy
Newton |
In
progress |
ASAP |
64 |
|
Proposal to deal with informing UIs to reconfigure when a WMS/LB changes it’s node name. This needs to be taken to the next ROC Managers' meeting. Update 2007-10-22: The UI
now can carry out service discovery. Steve Traylen is checking this. Need to
check if the version of the WMS can be found through service discovery. Update 2007-10-29: Pending a
GGUS ticket. https://gus.fzk.de/pages/ticket_details.php?ticket=28373 Update
2007-11-05: No update received |
Nick |
In
progress |
ASAP |
67 |
ATLAS |
Track queues showing VOViews problems at the weekly operations meeting. Update 2007-10-22: There
are now 90 queues still in question which is less than two weeks ago. Carry on
as is while the number continues to go down. Update
2007-11-05: Details on the tests
provided by Simone Campana (available in this agenda). He
reports: ‘There are currently 65 queues with problems. It would be nice if
some action could be taken’ |
Ops coord
/ ATLAS |
In
progress |
ASAP |
68 |
|
|
|
|
|
Closed
Action Items
#
|
Raised
by: |
Description
|
Assigned To: |
Due
date |
Date closed |
71 |
|
ATLAS to check if they know of any conflicts between SL kernel version 2.6 and either the application software or the middleware. Progress on 2006-12-11: Atlas was not present at the meeting, we’ll check offline Progress on 2006-12-18: Alessandro DeSalvo says in respect of the ATLAS application there is no problem. The only issue might raise (but not sure at all) from the Oracle client in the production system (which anyway has only 4 instances in the all Grid) and the Data Management Clients in the VOBOXES. So as long as this discussion does not refer to VOBOXes, this is OK. The VOBOXes (only 10 nodes for atlas, one at each T1) will need to be considered some time soon. I will get in touch with Miguel for this. The action can be closed. |
Simone |
11/12/06 |
18/12/06 |
76 |
|
VOs to update their mailing lists of grid users so that grid operational messages are communicated to all users when necessary. Progress on 2007-01-08: The
affected VOs have been contacted and the mailing lists updated. The action
can be closed. |
OCC / VOs |
15/01/07 |
15/01/07 |
75 |
|
Provide DPM to ATLAS for testing purposes in PPS service. Progress on 2007-01-08: 2 DPM nodes are now intalled at CERN PPS site. In progress Progress on 2007-01-15: This is now
done. ATLAS will use lxb2058 which is a DPM at the CERN_PPS site. |
Nick |
15/01/07 |
15/01/07 |
3 |
|
Maite to forward e-mail regarding solutions to APEL problems at IFAE, to the SWE ROC. Progress on 2007-01-18: This problem
is now fixed. The action can be closed. |
Maite |
29/01/07 |
29/01/07 |
2 |
|
SA3 to set up a wiki page to give guidance on renewing host certificates for the different grid services. OCC to circulate the wiki page and ask for feedback on it’s suitability. Progress on 2007-01-15: The wiki pages can be found here: https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to_replace_host_certificates Progress on 2007-01-29: There has been no negative feedback so this item will be closed. |
OCC |
29/01/07 |
29/01/07 |
4 |
|
Publish links to all SRM monitoring. Progress on 2007-01-18: The link to the SAM SRM monitoring is: The monitoring used by the FTS experts is a prototype and can be found here: http://pcitgm02.cern.ch:8081/ Progress on 2007-01-29: Completed. This item can be closed. |
OCC |
29/01/07 |
29/01/07 |
67 |
|
Timescale for move to Torque2? Progress on 2006-10-30: In progress. Progress on 2006-11-01: Expected to be in certification within 2-3 weeks. Progress on 2006-11-20: The counter was restarted last Friday, so it will go to PPS in 2-3 weeks from now. Estimated timeline: in PPS by ~15th Dec Progress on 2006-12-18: it was released to PPS on Monday, and removed on Tuesday due to a critical problem found. Progress on 2007-01-15: This is now being certified at CERN and by one SA3 partner in Progress on 2007-01-29: Still in certification. Progress on 2007-02-12: Waiting certification report from SA3 greek partner Progress on 2007-02-19: The patch containing this should be in the PPS next week. Progress on 2007-02-26: The patch is included in gLite 3.0 PPS-update 20, which is being
deployed. The action can be closed. |
OCC |
29/01/07 |
26/02/07 |
1 |
|
Document VO expiration procedure and associated error message when it happens. Progress on 2007-01-15: This will be put into the GOC wiki. Progress on 2007-01-29: In progress. Progress on 2007-02-12: Maite will update it this week once she gets access to the GOC wiki Progress on 2007-02-19: In progress. Progress on 2007-02-26: Gocwiki updated. This action can be closed. |
OCC |
22/01/07 |
26/02/07 |
6 |
|
Item for COD agenda: Running the RB SAM tests on demand. Should this be a request to the test developers or should the CODs be in the OPS VO? Progress on 2007-02-12: Report about the conclusion once the COD minutes are published Progress on 2007-02-19: The minutes of the COD meeting are not yet published. Progress on 2007-02-26: The minutes of the COD meeting are not yet published. Progress on 2007-03-05: This has been
solved by finding a second RB to send SAM tests. The action can be closed. |
Helene |
05/03/07 |
05/03/07 |
7 |
|
Item for COD agenda: Filtering out of SAM site test failures due to the failure of a grid central service. Progress on 2007-02-12: Report about the conclusion once the COD minutes are published. Progress on 2007-02-19: The minutes of the COD meeting are not yet published. Progress on 2007-02-26: The minutes of the COD meeting are not yet published. Progress on 2007-03-05: A 3 point strategy has been defined at last COD. They are looking for people to implement it: However, timeline and details need to be refined by SAM Team as much part of the work is to be done at SAM and they are closer to the thing. The action can be closed. |
Helene |
26/02/07 |
05/03/07 |
9 |
|
Request from DM developers, they are testing SRM 2.2, gfal and fts need to know the version of srm they are talking to. Test will be added to gstat by the end of this week. Progress on 2007-03-05: this should go in as a WARN this week to raised to ERROR later in a couple of weeks. This action can be closed. |
Gstat team |
12/03/07 |
05/03/07 |
10 |
|
Escalate ticket 18279, WMS condorc-luncher files filling /tmp, raised by the DECH ROC, to the EMT Progress on 2007-03-05: Tmpwatch can be
configured to clean those files up more often, even once a day, if needed. https://savannah.cern.ch/patch/?1062 The configuration is being worked on now, so it will be made available in around a month. The action can be closed. |
OCC |
12/03/07 |
12/03/07 |
11 |
|
Point for next meeting agenda: information about WMS – RB deployment strategy Progress on 2007-03-05: Ian Bird will attend next meeting to give a status update on this. The action can be closed. |
OCC |
12/03/07 |
12/03/07 |
70 |
|
Conclude on "Policy for security updates of third party software". The gLite integration team policy is: the external packages are not guaranteed to be maintained. They are provided for convenience. They are maintained by their providers. The reality is that they will be maintained on best effort. To be clarified with the security team. Progress on 2006-11-27: being discussed with OSCT and SA3 Progress on 2006-12-04: This item was discussed during the meeting. Waiting for SA3 to create the final list of external packages which need to be maintained. Progress on 2007-01-15: Waiting for feedback from SA3. Progress on 2007-01-29: Waiting for feedback from SA3. There is a proposal to focus on removing all unnecessary dependencies within the gLite code which will probably impact this item, so the deadline will be extended. Progress on 2007-02-12: From Oliver: The progress to date is here; https://twiki.cern.ch/twiki/bin/view/EGEE/SourceTarballs In other words, some sorting of the externals; identification of what is maintained or out-of-date, what can be reclassified etc. There is a plan being drawn up for a big effort on reducing the external dependencies. If I remember correctly, this plan is due next week. I think it's best to wait for that. The work done so far will be fed into this plan. Progress on 2007-02-19: No update. Progress on 2007-03-05: (Update offline after the meeting) A gLite restructuring plan has been worked out by the integration, middleware and operations teams to make a radical examination of the code base with a view to removing unnecessary dependencies and cleaning up sections of the code that cause build and porting difficulties. We following tasks have to be well advanced before the execution of the plan can be started. The EGEE PMB monitors the progress every second week. - Move to SL4 on worker nodes and user interface - Move to ETICS build infrastructure - Stabilization and scalability of WMS and LB - Stabilization and scalability of the gLite-CE Even if the plan execution has not started, the developers have already started cleaning the dependencies while porting to the new build system. I would propose to close this action as
it is being tracked somewhere else, and come back with information to the
operations meeting once it is ongoing. |
OCC |
15/01/07 11/02/07 |
12/03/07 |
8 |
|
From COD report: RB-time_to_match test must be improved. Many alarms linked with this test were useless. To be fixed by UKI, Steve will open a GGUS ticket and assign it to them Progress on 2007-03-05: ticket 19319 opened https://gus.fzk.de/ws/ticket_info.php?ticket=19319&from=ID Progress on 2007-03-12: In progress. Progress on 2007-03-19: This is now
fixed and the ticket can be closed. |
UKI ROC |
12/03/07 |
19/03/07 |
15 |
|
OCC to chase up a solution for ticket 19464 (https://gus.fzk.de/pages/ticket_details.php?ticket=19464) Progress on 2007-03-19: This ticket is
now being handled by the gLite integration team. |
OCC |
26/03/07 |
19/03/07 |
16 |
|
LHCb to produce a list of services that will be affected due to the new SRM v1 endpoint at INFN-T1. Also, a coordinator for the intervention must be found. Progress on 2007-03-19: The list of
services has been provided. The intervention coordinator is Marianne Bargiotti. |
LHCb |
26/03/07 |
19/03/07 |
17 |
|
LHCb wish to use dccp across the WAN to do some pre-staging. LHCb to provide the relevant tier 1 sites with the details of what is required (port numbers, port types, etc.) Progress on 2007-03-19: LHCb has
directly contacted the sites involved (RAL, GridKA and IN2P3). RAL is in the
process of opening the relevant ports. GridKA and IN2P3 have yet to confirm
their position. |
LHCb |
26/03/07 |
19/03/07 |
15 |
- |
SA3 need to know when the larger sites plan to move to SL4 (and binary compatable) machines. This will allow better planning of the move of the middleware from SL3 to SL4. This information should be send to Maite.Barroso.Lopez@cern.ch and Nicholas.Thackray@cern.ch Progress on 2007-03-26: The following ROCs/sites have supplied information: SW Europe ROC/PIC, DECH/FZK, SARA, Progress on 2007-04-02: In progress. Progress on 2007-04-16: It can be closed, all major sites have given feedback, put together at the following wiki: https://twiki.cern.ch/twiki/bin/view/EGEE/Sites_Plans_to_go_to_SLC4 |
All EGEE ROCs |
16/04/07 |
26/03/07 |
14 |
- |
Site HPC2N have the SE ibelieve-i.hpc2n.umu.se that they want to take out of production. Simone to look at the data on the SE and decide if it needs to be moved or can be thrown away. Progress on 2007-03-19: The site has provide ATLAS with a list of files which exist on the SE. ATLAS will go through this and decide what to do with the files. Per requests that this be carried out as quickly as possible. Progress on 2007-03-26: No progress due to ATLAS’s problems with CASTOR and the ATLAS week in
Progress on 2007-04-02: No-one at the meeting knew of any update. Per will follow it up. Progress on 2007-04-16: Nobody from atlas online, chase offline Progress on 2007-04-23: As there was no response form Atlas, the files were removed. Contact the site admin if more info is required: ake.sandgren@hpc2n.umu.se This action can be closed. |
Simone C. / Per Oster |
23/04/07 |
26/03/07 |
13 |
- |
Some sites are accidentally publishing a local LFC as a global LFC. How might this be prevented? Progress on 2007-03-19: Steve spoke to Judit who thinks that a solution can be implemented in
the FCR tool. She will check. Progress on 2007-03-26: no update at the meeting Progress on 2007-04-02: Judit has analysed the feasibility of providing this functionality
using the FCR and it can be done. It will go into the work plan. |
Judit Novak |
2/04/07 |
2/04/07 |
18 |
- |
A middleware tool for carrying out the bulk removal files from SEs, appropriately updating the catalogues, etc. is needed. This will be put onto the List of Issues maintained by the ROCs and regularly presented to the TCG. Progress on 2007-04-02: This is now on the ROC Top Issues list, item 23 (https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_TCG) |
OCC |
02/04/07 |
02/04/07 |
19 |
- |
Request for mechanism so top level BDIIs can publish themselves Progress on 2007-03-26: bug opened by Syeve Traylen, to be raised at the EMT: https://savannah.cern.ch/bugs/?25033 Progress on 2007-04-02: This work has been scheduled to be done immediately after the release
of gLite 3.1. |
OCC |
02/04/07 |
02/04/07 |
21 |
- |
Development/testing/certification/PPS status of MySQL LFC Progress on 2007-03-26: PPS: several local instances, plus the global LFC at PIC run MySQL. Certification: Both flavours of LFC
are tested and have permanent installations on the testbed. To be closed. |
OCC |
02/04/07 |
02/04/07 |
22 |
- |
Estimation on when SLC4 WNs will be available in production Progress on 2007-04-02: The SL4 natively compiled WN is now in the PPS. It will be tested by
the HEP VOs and when they are happy with it, the WN will be passed to
production. Action to be closed. |
OCC |
02/04/07 |
02/04/07 |
23 |
- |
Estimation on when the unified version of RFIO client for DPM and castor will be in production Progress on 2007-03-26: update from the developers: no manpower, this is not expected this
year. |
OCC |
02/04/07 |
02/04/07 |
29 |
CERN/ Triumf/ |
Escalate the problems being seen with the job wrapper tests to the ROC managers’ meeting. Progress on 2007-04-16: This was done, Piotr presented it at today’s meeting (see minutes).
The action can be closed. |
Nick |
16/04/07 |
3/04/07 |
24 |
ROC DECH |
Ask the R-GMA development team if they can attend the next grid operations meeting to answer questions on instabilities seen in the R-GMA system. Progress on 2007-04-16: Done, the action can be closed |
OCC |
16/04/07 |
16/04/07 |
25 |
All |
Ask SA3 to give an update at the next grid operations meeting on the status of the port to SL4 and also the relative priorities of the different middleware services. Progress on 2007-04-16: Done this can be closed |
OCC |
16/04/07 |
16/04/07 |
28 |
OSG |
Check that Laurence Field is the correct person for OSG to contact regarding their problems with the BDII. Progress on 2007-04-16: Yes, it is Laurence, please, contact him. The action can be closed. |
OCC |
16/04/07 |
16/04/07 |
26 |
ROC SEE |
Check up on the status of the following tickets: https://gus.fzk.de/pages/ticket_details.php?ticket=18689 Progress on 2007-04-16: 18353: not solved but some activity, ongoing. 18698: no progress for a long time, no answer. To be raised again. Progress on 2007-04-23: These tickets have been raise to the EMT on 25/04/07. Some action will follow and will be reflected in the tickets. Progress on 2007-05-07: These tickets are now being handled. |
OCC |
07/05/07 |
16/04/07 |
34 |
SEE ROC |
gLite 21 update release notes stated
that reconfiguration is needed just for lcg-CE, lcg-CE_torque, and glite-CE,
but in fact you need to introduce new accounts on all WNs at the same time.
This is the list of GGUS ticket we crated so far: Progress on 2007-05-07: 2 tickets are
now solved and one unsolved ( |
OCC - YAIM |
07/05/07 |
07/05/07 |
31 |
COD ( |
Solve first issue in the list of COD notes for today’s meeting: http://egee-docs.web.cern.ch/egee-docs/operational_tools/Operations_Meetings/2007/Weekly_Operations_Meeting_minutes_2007-04-23.htm Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lapp-rb01.in2p3.fr:7772) Progress on 2007-05-07: This is not an issue for the PPS to solve, but for the site to solve.
The site should ask for help from their ROC and if the ROC cannot help then a
ticket should be raised. Close. |
PPS coordination |
07/05/07 |
07/05/07 |
20 |
- |
Search or request documentation about how is a VOMS proxy mapped on a grid node (CE, SE, etc.) using LCMAPS. Maria Dimou provided pointers to docs from 2005. Waiting for input from Maarten and David Groep. Progress on 2007-04-02: In progress. Progress on 2007-04-16: The available
information was collected and sent to Progress on 2007-04-23: Progress on 2007-05-07: a GGUS ticket was submitted (GGUS #21349) to ask developers for a complete LCMAPS administrator guide. The ticket has been raised to the EMT. Progress on 2007-05-14: Comment form
the EMT: we will work on it as soon as possible. This action can be closed. |
OCC |
14/05/07 |
21/05/07 |
27 |
OCC |
Report on the outcome of the tests regarding grid site status information (taken from the information system, the WMS, the batch systems, etc.). Progress on 2007-04-16: Patricia reported at it during the meeting. We’ll leave it open till we have the outcome Progress on 2007-04-23: data form grid
site tests were made available in MonAlisa. Progress on 2007-05-07: In progress. Progress on 2007-05-14: Patricia
reported about this; see her report in the |
|
14/05/07 |
21/05/07 |
30 |
ROC SEE |
Create a wiki page to collect information about deployment of SL4 gLite services, even with workarounds: http://wiki.egee-see.org/index.php/SL4_WN Progress on 2007-04-23: not done yet. Does the SEE ROC volunteer to do it under the main SA1 wiki or at the GOC wiki? Kostas: no manpower. Volunteers are welcome. Progress on 2007-05-14: there are no
volunteers, not many people interested, so ROC SEE agrees to close the action |
OCC |
23/04/07 |
21/05/07 |
32 |
CERN ROC, TRIUMF |
SAM still handles timezones incorrectly. Maintenance on Fri 20th scheduled for 14:00 - 16:00 UTC but SAM show maintenance incorrectly at 08:04 UTC and in error at 14:02 UTC, i.e. wrongly during our maintenance Judit: the timestamp is taken by SAM directly from the GOCDB as it is registered. GOCDB seems not to correctly convert time zones. GOC DB people will look into the problem together with Judit who will send all the details already found. Progress on 2007-05-07: This now looks like it might be a SAM issue. Still under investigation. Progress on 2007-05-14: Judit is following up Progress on 2007-05-21: Bug opened in |
SAM, GOCDB |
04/06/07 |
21/05/07 |
33 |
CERN ROC, FNAL |
We set up a 2nd lcg gateway for redundancy. But if either goes down, SAM flags us as being down, thereby defeating the purpose of the 2nd gateway. Of course we are still operational, only SAM is marking us incorrectly. How can this be improved? I was told CERN runs multiple gateways, how do they handle this? 2. We need to split the cmswnNNN accounts on the 2 gateways since they operate independently Judit: SAM considers the gateways separately as two computing elements and they are both monitored. The issue has to be followed off-line between SAM support and FNAL. Progress on 2007-05-14:Judit is following up Progress on 2007-05-21: SAM support didn't find the problem they reported, and has explained them that they only see the expected behavior of SAM; any additional comment from the originators or can we close this action? Update from FNAL: FNAL is checking again if they appear as completely down when only one CE is down. Test being done now, more news next week. Progress on 2007-04-06: according to
Joe Kaiser and Judit Novak, the issue is solved. |
SAM |
14/05/07 4/06/07 |
21/05/07 |
37 |
ROC |
YAIM dosen’t support multiple clusters/sub-clusters per CE. A bug needs to be submitted for this. Progress on 2007-04-06: https://savannah.cern.ch/bugs/?26757 To be closed. |
Steve Traylen |
04/06/07 |
21/05/07 |
35 |
UKI ROC |
Technical issues to do with the email
that CIC-Portal Alarms send: Progress on 2007-05-21: Both
points have been addressed. Close this item. |
CIC team |
04/06/07 |
04/06/07 |
38 |
ROC UKI |
UKI-SCOTGRID-GLASGOW had to clear jobs which were stalled due to lcg-cr commands hanging (http://scotgrid.blogspot.com/2007/05/users-and-stalled-jobs.html). No response from biomed user who was responsible for most of these (https://gus.fzk.de/pages/ticket_details.php?ticket=22717). User will be banned from our site if no response is forthcoming. We believe this is a reasonable policy for our site, but are there official guidelines on this? Steve has contacted Romain to see if there is anything existing on this policy area. Romain forwarded a proposal to the OSCT, under discussion. Progress on 2007-06-25: mail sent to Romain to check progress Progress on 2007-07-02: The ticket has been
solved and closed. Seemed to be a problem on the remote SE,
which was uncovering a low level bug in the gridftp v1 code. See also
https://gus.fzk.de/ws/ticket_info.php?ticket=22634 |
Steve/Romain |
02/07/07 |
18/06/07 |
39 |
DECH ROC |
ATLAS to supply a list of sites at which the FQAN VOviews tags should be removed. Progress on 2007-06-25: Done, Simone sent this last week. It is now
linked to the agenda page. |
Simone Campana / ATLAS |
25/06/07 |
25/06/07 |
41 |
OCC |
FTS development team and integration release team to remove FTS 2.0 from production repository Progress on 2007-07-02: Gavin distributed a proposal on 3/07/07 to the T1s about how to proceed: "To avoid prematurely upgrading to FTS 2.0 from the production repository the gLite integration team suggest disabling or removing the gLite apt or yum config files on the FTS server nodes, since going back to previous gLite release it not trivial with the current system." If we do the staged-release thing again, we will keep the 'new' version in the PPS repository until it is time for wider deployment. FYI. The patch for the new version of FTS 2.0 has entered gLite integration (with a variety of bug fixes, as found on the FTS pilot service and on CERN-PROD) Progress on 2007-07-09: check with Gavin Progress on 2007-07-16: no objection from T1s. the action can be closed. |
FTS team |
16/07/07 |
02/07/07 |
45 |
ROC |
Accounting portal isn’t showing data for the last 3 months. Progress on 2007-07-09: It works
now, it can be closed. |
ROC SWE |
9/07/07 |
9/07/07 |
44 |
ROC |
Schedule the updating of the information system in the production service (top-level BDII, site BDII, etc.) Update 3/07/07: The broadcast was sent today. This item can be closed. |
OCC |
3/07/07 |
9/07/07 |
43 |
ROC |
Submit a change request to the GOCDB to have a flag which shows whether a service is considered as being in production. Progress on 2007-07-09: request
submitted and accepted by the GOCDB team. This action can be closed. |
ROC |
9/07/07 |
9/07/07 |
40 |
OCC |
Publish dates and features of the next gridview release Progress on 2007-07-02: After the discussions between the gridview team and WLCG a decision was made that the sites have to confirm if the new availability calculation is acceptable for them. This means that the algorithm will have to be presented to them (not yet agreed exactly when and how), and a release can come only following that. So for the time being Gridview can't present neither the new
features coming with the next release (as they might need to be changed), nor
the approximate time of the release (unsure yet). |
SAM/Gridview |
02/07/07 |
09/07/07 |
42 |
PPS |
Decide whether the severity of the broker-info test needs to be temporarily down graded from critical. Update 3/07/07: The problem was
found to be that the LINUX “which” command was not supported at the site. It
is expected that this command should be available at all sites and so the SAM
test does not need to be changed. This item can be closed. |
OCC |
02/07/07 |
9/07/07 |
46 |
ROC |
Russian COD team to send to ROC managers the list of ~10 CEs unregistered in GOCDB and monitored by SAM (because they appear in the production top-level BDIIs) Progress on 2007-07-16: Can’t now do this as the data is transitory
and can’t be recreated. COD will raise tickets when they come across similar
sites. Close the item. |
ROC |
16/07/07 |
16/07/07 |
48 |
ROC |
Make public the official URLs to be used for a TOP BDII configuration Progress on 2007-07-16: Top URL for BDII configurations has been
put on the CERN-ROC page. Close the item. |
ROC CERN |
16/07/07 |
16/07/07 |
47 |
ROC |
In the weekly ROC report, we can fill the T1 availability text field, but it's impossible to consult it afterwards. Please update the consultation of ROC report to make available this field. Progress on
2007-07-23: Field is now available to be consulted. Few improvements
requested by Maite and Alberto Aimar to Osman are being implemented. The action
can be closed. |
CIC team |
16/07/07 |
30/07/07 |
49 |
OCC |
Tier 1 sites to include status updates of their migration to the SL4 WN in their weekly site reports. Progress on
2007-07-23: Some sites/ROC already reported for today’s meeting. See agenda/minutes.
We’ll keep it as a recurrent point in next week’s agendas, till the end of
August, so the action can be closed. |
All tier 1 sites |
31/08/07 |
30/07/07 |
50 |
UKI ROC |
All ROCs to check with sites in one week notice is enough to change firewall rules as a consequence of the change of HW and IP address for the R-GMA registry at RAL. The change will take place in September. Progress on
2007-07-30: There are no objections, so RAL can proceed with the plan. They
will send a broadcast and also give notification at the operations meeting
one week in advance of their intervention, starting from September. |
All ROCs |
06/08/07 |
30/07/07 |
34 |
SWE ROC |
What is the status of the VO configuration tool in the CIC portal? Is it available? Progress on 2007-05-14: work in progress, we’ll report in the coming weeks Progress on 2007-04-06: update requested by mail to the CIC portal team: The Oracle migration is complete. So are the small modifications of the way VO registrations are handled in the YAIM VO Configurator tool (the use of the CIC VO table and the difference in the way "official" and "non-official" registrations are handled). What is needed is just some more tests before i commit the code in the IN2P3 CVS repository Progress on 2007-06-25: 1st version being tested by the CIC team. It will be put in production with an intermediate step: web service at cern and DB at IN2P3; this does not disturb normal operation. Full integration later during the summer. Progress on 2007-07-02: YAIM VO Configurator
is available at the following URL: Progress on 2007-07-09: summarize status after demo with Dimitir. Next YAIM version will still have the VO examples, as the YAIMtool will not be ready. Progress on 2007-07-16: Work to be done at the CIC portal waiting for Gilles to be back from vacation; will resume at the beginning of August. Wait till then. Progress on 2007-08-06: In progress.
Completion expected by the end of August. See the minutes of this meeting.
Close. |
Hélène Cordier |
13/08/07 |
06/08/07 |
52 |
Several sites |
Request CMS and LHCb to state at the next meeting if they have any requirements for (non-tier-1 sites) sites moving to SL4/gLite 3.1 Update 2007-08-08: CMS gave the following statement: All CMS sites should move to SL4 as soon as possible. CMSSW_1_5 and successive will only be built for SL4. (The source of the statement is Stefano Belforte) LHCb gave the following statement: LHCb are not pushing sites to migrate to SL4. Until recently, LHCb faced some issues running on SL4 WNs because of the previously reported lcg-cp problem which took time to understand. LHCb are now splitting the CEs into SL3 and SL4 such that when issues arise, it is clear as to which version of the OS they are related to. Until the workaround for the lcg-cp problem is fixed in DIRAC (most likely next week) LHCb are preventing their jobs going to SLC4 sites. After that, just like ATLAS and (The source of the statement is Philippe Charpentier) Update 2007-08-13: Responses have now been received from all
experiments and so this item can be closed. |
CMS / LHCb |
20/08/07 |
13/08/07 |
53 |
|
Announce to all VOs the decision of the operations meeting that by default sites will configure the VOs they support with pool accounts for SGM and PRD. If a VO does not want this then they should make this clear on their VO ID card. Update 2007-08-08: Announcement sent to the VO managers and CC to Fred Schaer (Coordinator of the EGEE VO managers). Update 2007-08-13: This item can be closed. |
OCC |
20/08/07 |
13/08/07 |
56 |
|
When will gLite be ported to ia64 (Itanium)? What is the proposed support plan for ia64? Update 2007-08-08: This will be discussed at the next operations meeting (13th August). Update 2007-08-13: Discussed during the meeting. Close. |
SA3 |
20/08/07 |
13/08/07 |
57 |
PIC |
CMS requested to transfer files from PIC to FNAL using SRM copy mode instead of the default URL copy. They are asking for specific configuration with dedicated channels. We would like for some clarifications on this subject. Update 2007-08-08: (From Gavin) The FTS setup for CMS is still under discussion. Gavin will follow this up. Close? Update 2007-08-13: Close. |
Gavin/Steve |
20/08/07 |
13/08/07 |
61 |
COD |
SAM to suppress alarms when a site is in downtime. Update 2007-08-16: SAM should already supress alarms for sites/nodes
that are in scheduled downtime. If this is seen again, please raise a GGUS
ticket. Close. |
SAM |
ASAP |
20/08/07 |
62 |
BNL |
The plots of hourly report for VO "OPS", and Tier 1 site at BNL seems not right. It is strange that all individual services are green and the overall services showed different result. How is the overall service generated? Can this problem be fixed? Update 2007-08-16: There wasn’t a problem. The site BDII test was
failing, hence not all services were green. Close. |
SAM |
ASAP |
20/08/07 |
51 |
SA3 |
Would there be any problems with using the externally maintained “DAG” repository for external dependencies of the middleware? Update 2007-08-08: All ROCs asked to contact their sites for feedback. No response by 20 August will be taken as agreement with the proposal to use the DAG repository. Update 2007-08-16: Replies to the proposition can be found here: Update 2007-08-20: Discussed during the
meeting. See the minutes. Close |
EGEE ROCs / sites |
20/08/07 |
20/08/07 |
54 |
WLCG |
All tier-1 sites to report at each meeting until the end of August on their status and plans regarding moving to SL4 / gLite 3.1 WNs. Action expired. Close. |
Tier-1 sites |
27/08/07 |
27/08/07 |
55 |
Several ROCs |
Find out what are the plans to test SRM 2.2 in the production service. Update 2007-08-08: This topic will be discussed at the LCG Management Board this week so there should be some feedback for the grid operations meeting on 13th August. Update 2007-08-13: Last week’s MB was cancelled. This item will be discussed tomorrow. Update 2007-08-20: Discussed during the meeting. See that minutes.
Close. |
OCC |
20/08/07 |
20/08/07 |
58 |
PIC |
PIC has deployed SL4 WNs. To check if these WNs are OK for LHCb they have been using the LHCb specific SAM testes and everything seems OK, however the LHCb software seems to be not properly installed on our WNs. LHCb to give comment on this. Update 2007-08-13: LHCb not at the meeting so no update. Update 2007-08-20: e-mail sent to PIC and LHCb. This will be followed
up off-line. Close. |
LHCb |
As soon as possible |
20/08/07 |
60 |
COD |
PPS to change how the SAM monitoring is carried out so that the certificate used for submitting the jobs is associated with only one VO. Update 2007-08-20: Done (see PPS section of the minutes). Close. |
PPS |
20/08/07 |
20/08/07 |
63 |
OCC |
SAM team to present first draft of process for user testing and announcement of updates. Update 2007-08-27: Done. Close. |
SAM |
27/08/07 |
27/08/07 |
12 |
- |
Clarify site implications from SE downtime, as tests on CE of course fail as well since they need a default SE. Should the site be put in complete downtime? (Note from the minute taker, the default SE could be changed to some other site during the SEs downtime) How does this affect the VOs? Progress on 2007-03-26: In progress. Maite is coordinating. Progress on 2007-04-16: No news, to follow up Progress on 2007-04-23: Proposal to follow the SAM site availability calculation rules:
Which would mean: - for sites with only one SE, if the SE is down, the whole site will be considered as down, so schedule site downtime - for sites with more than 1 SE, if only one (or less than all) will be down, schedule downtime only for that SE - for sites with more than 1 SE, if all will be down, schedule site downtime
Summarizing: If a site has no SEs available, it should be declared as down Progress on 2007-05-07: There is a long running e-mail discussion on this. Progress on 2007-05-07: no update Progress on 2007-05-21: In progress. Progress on 2007-06-25: this issue has been raised again at today’s meeting. I’ll put all available information in a wiki and try to come up with a proposal. Progress on 2007-07-09: in progress Progress on 2007-07-16: Progress information available at https://savannah.cern.ch/task/?5222 Progress on 2007-08-06: Still in progress
but an outcome is expected soon. See the above Progress on 2007-08-13: Still in progress. Progress on 2007-08-20: Still in progress. Progress on 2007-08-27: Still in progress. Progress on 2007-09-10: Still in progress. Progress on 2007-10-08: Still in
progress. This is now tracked in the SA1 Technical Issues project in |
OCC |
As soon as possible |
15/10/07 |
36 |
SWE ROC |
Items for COD meeting agenda in Progress on 2007-04-06: It is now included in COD’s agenda. Wait for report and conclusion Progress on 2007-06-18: This was
discussed at the COD meeting and operations workshop in Progress on 2007-06-25: mail sent to PPS to inquire about the conclusion Progress on 2007-07-02: no progress, Helene and Nick to discuss Progress on 2007-07-09: in progress Progress on 2007-07-30: No progress due to people on vacation. Will resume when Hélène returns. Progress on 2007-08-20: This is being discussed by the PPS site administrators. When an agreement has been reached there, a discussion will be had with the CODs. Progress on 2007-08-20: In progress. Progress on 2007-09-17: In progress. To be discussed at EGEE 07 Progress on 2007-09-17: This was
discussed and agreed at EGEE 07. In brief, the CODs will no longer raise
tickets for the PPS but instead a weekly report will be generated which can
be used to monitor how well sites are monitoring and fixing themselves. All
PPS sites will sign up to the RSS feed from the CIC Portal. |
OCC |
04/06/07 As soon as possible |
15/10/07 |