Weekly Operations' Meeting Action List
Status as of 21st May 2007
Due Date colour key:
Red: action is overdue |
Yellow: action is due at or before next meeting |
White: action is due some time after the next meeting |
Grey: Action closed at the last meeting. |
Open Action Items
# |
Raised by: |
Description |
Assigned to: |
Status |
Due date |
12 |
- |
Clarify site implications from SE downtime, as tests on CE of course fail as well since they need a default SE. Should the site be put in complete downtime? (Note from the minute taker, the default SE could be changed to some other site during the SEs downtime) How does this affect the VOs? Progress on 2007-03-26: In progress. Maite is coordinating. Progress on 2007-04-16: No news, to follow up Proposal to follow the SAM site availability calculation rules:
Which would mean: - for sites with only one SE, if the SE is down, the whole site will be considered as down, so schedule site downtime - for sites with more than 1 SE, if only one (or less than all) will be down, schedule downtime only for that SE - for sites with more than 1 SE, if all will be down, schedule site downtime
Summarizing: If a site has no SEs available, it should be declared as down Progress on 2007-05-07: There is a long running e-mail discussion on this. Progress on 2007-05-07: no update Progress on 2007-05-21: In progress. |
COD + SAM |
In progress |
19/03/07 |
33 |
CERN ROC, FNAL |
We set up a 2nd lcg gateway for redundancy. But if either goes down, SAM flags us as being down, thereby defeating the purpose of the 2nd gateway. Of course we are still operational, only SAM is marking us incorrectly. How can this be improved? I was told CERN runs multiple gateways, how do they handle this? 2. We need to split the cmswnNNN accounts on the 2 gateways since they operate independently Judit: SAM considers the gateways separately as two computing elements and they are both monitored. The issue has to be followed off-line between SAM support and FNAL. Progress on 2007-05-14:Judit is following up Progress on 2007-05-21: SAM support didn't find the problem they reported, and has explained them that they only see the expected behavior of SAM; any additional comment from the originators or can we close this action? Update from FNAL: FNAL is checking again if they appear as completely down when only one CE is down. Test being done now, more news next week. Progress on 2007-04-06: according to Joe Kaiser and Judit Novak, the issue is solved. |
SAM |
To be closed |
4/06/07 |
34 |
SWE ROC |
What is the status of the VO configuration tool in the CIC portal? Is it available? Progress on 2007-05-14: work in progress, we’ll report in the coming weeks Progress on 2007-04-06: update requested by mail to the CIC portal team: The Oracle migration is complete. So are the small modifications of the way VO registrations are handled in the YAIM VO Configurator tool (the use of the CIC VO table and the difference in the way "official" and "non-official" registrations are handled). What is needed is just some more tests before i commit the code in the IN2P3 CVS repository |
Hélène Cordier |
In progress |
11/06/07 |
36 |
SWE ROC |
Items for COD meeting agenda in Stockholm: How to deal with failing PPS sites. Progress on 2007-04-06: It is now included in COD’s agenda. Wait for report and conclusion |
OCC |
New |
04/06/07 |
37 |
ROC France |
YAIM dosen’t support multiple clusters/sub-clusters per CE. A bug needs to be submitted for this. Progress on 2007-04-06: https://savannah.cern.ch/bugs/?26757 To be closed. |
Steve Traylen |
To Be Closed |
04/06/07 |
38 |
ROC UKI |
UKI-SCOTGRID-GLASGOW had to clear jobs which were stalled due to lcg-cr commands hanging (http://scotgrid.blogspot.com/2007/05/users-and-stalled-jobs.html). No response from biomed user who was responsible for most of these (https://gus.fzk.de/pages/ticket_details.php?ticket=22717). User will be banned from our site if no response is forthcoming. We believe this is a reasonable policy for our site, but are there official guidelines on this? Steve has contacted Romain to see if there is anything existing on this policy area. Romain forwarded a proposal to the OSCT, under discussion. |
Steve/Romain |
New |
18/06/07 |
Closed Action Items
# |
Raised by: |
Description |
Assigned To: |
Due date |
Date closed |
71 |
|
ATLAS to check if they know of any conflicts between SL kernel version 2.6 and either the application software or the middleware. Progress on 2006-12-11: Atlas was not present at the meeting, we’ll check offline Progress on 2006-12-18: Alessandro DeSalvo says in respect of the ATLAS application there is no problem. The only issue might raise (but not sure at all) from the Oracle client in the production system (which anyway has only 4 instances in the all Grid) and the Data Management Clients in the VOBOXES. So as long as this discussion does not refer to VOBOXes, this is OK. The VOBOXes (only 10 nodes for atlas, one at each T1) will need to be considered some time soon. I will get in touch with Miguel for this. The action can be closed. |
Simone |
11/12/06 |
18/12/06 |
76 |
|
VOs to update their mailing lists of grid users so that grid operational messages are communicated to all users when necessary. Progress on 2007-01-08: The affected VOs have been contacted and the mailing lists updated. The action can be closed. |
OCC / VOs |
15/01/07 |
15/01/07 |
75 |
|
Provide DPM to ATLAS for testing purposes in PPS service. Progress on 2007-01-08: 2 DPM nodes are now intalled at CERN PPS site. In progress Progress on 2007-01-15: This is now done. ATLAS will use lxb2058 which is a DPM at the CERN_PPS site. |
Nick |
15/01/07 |
15/01/07 |
3 |
|
Maite to forward e-mail regarding solutions to APEL problems at IFAE, to the SWE ROC. Progress on 2007-01-18: This problem is now fixed. The action can be closed. |
Maite |
29/01/07 |
29/01/07 |
2 |
|
SA3 to set up a wiki page to give guidance on renewing host certificates for the different grid services. OCC to circulate the wiki page and ask for feedback on it’s suitability. Progress on 2007-01-15: The wiki pages can be found here: https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to_replace_host_certificates Progress on 2007-01-29: There has been no negative feedback so this item will be closed. |
OCC |
29/01/07 |
29/01/07 |
4 |
|
Publish links to all SRM monitoring. Progress on 2007-01-18: The link to the SAM SRM monitoring is: The monitoring used by the FTS experts is a prototype and can be found here: http://pcitgm02.cern.ch:8081/ Progress on 2007-01-29: Completed. This item can be closed. |
OCC |
29/01/07 |
29/01/07 |
67 |
|
Timescale for move to Torque2? Progress on 2006-10-30: In progress. Progress on 2006-11-01: Expected to be in certification within 2-3 weeks. Progress on 2006-11-20: The counter was restarted last Friday, so it will go to PPS in 2-3 weeks from now. Estimated timeline: in PPS by ~15th Dec Progress on 2006-12-18: it was released to PPS on Monday, and removed on Tuesday due to a critical problem found. Progress on 2007-01-15: This is now being certified at CERN and by one SA3 partner in Greece. Progress on 2007-01-29: Still in certification. Progress on 2007-02-12: Waiting certification report from SA3 greek partner Progress on 2007-02-19: The patch containing this should be in the PPS next week. Progress on 2007-02-26: The patch is included in gLite 3.0 PPS-update 20, which is being deployed. The action can be closed. |
OCC |
29/01/07 |
26/02/07 |
1 |
|
Document VO expiration procedure and associated error message when it happens. Progress on 2007-01-15: This will be put into the GOC wiki. Progress on 2007-01-29: In progress. Progress on 2007-02-12: Maite will update it this week once she gets access to the GOC wiki Progress on 2007-02-19: In progress. Progress on 2007-02-26: Gocwiki updated. This action can be closed. |
OCC |
22/01/07 |
26/02/07 |
6 |
|
Item for COD agenda: Running the RB SAM tests on demand. Should this be a request to the test developers or should the CODs be in the OPS VO? Progress on 2007-02-12: Report about the conclusion once the COD minutes are published Progress on 2007-02-19: The minutes of the COD meeting are not yet published. Progress on 2007-02-26: The minutes of the COD meeting are not yet published. Progress on 2007-03-05: This has been solved by finding a second RB to send SAM tests. The action can be closed. |
Helene |
05/03/07 |
05/03/07 |
7 |
|
Item for COD agenda: Filtering out of SAM site test failures due to the failure of a grid central service. Progress on 2007-02-12: Report about the conclusion once the COD minutes are published. Progress on 2007-02-19: The minutes of the COD meeting are not yet published. Progress on 2007-02-26: The minutes of the COD meeting are not yet published. Progress on 2007-03-05: A 3 point strategy has been defined at last COD. They are looking for people to implement it: However, timeline and details need to be refined by SAM Team as much part of the work is to be done at SAM and they are closer to the thing. The action can be closed. |
Helene |
26/02/07 |
05/03/07 |
9 |
|
Request from DM developers, they are testing SRM 2.2, gfal and fts need to know the version of srm they are talking to. Test will be added to gstat by the end of this week. Progress on 2007-03-05: this should go in as a WARN this week to raised to ERROR later in a couple of weeks. This action can be closed. |
Gstat team |
12/03/07 |
05/03/07 |
10 |
|
Escalate ticket 18279, WMS condorc-luncher files filling /tmp, raised by the DECH ROC, to the EMT Progress on 2007-03-05: Tmpwatch can be configured to
clean those files up more often, even once a day, if needed. https://savannah.cern.ch/patch/?1062 The configuration is being worked on now, so it will be made available in around a month. The action can be closed. |
OCC |
12/03/07 |
12/03/07 |
11 |
|
Point for next meeting agenda: information about WMS – RB deployment strategy Progress on 2007-03-05: Ian Bird will attend next meeting to give a status update on this. The action can be closed. |
OCC |
12/03/07 |
12/03/07 |
70 |
|
Conclude on "Policy for security updates of third party software". The gLite integration team policy is: the external packages are not guaranteed to be maintained. They are provided for convenience. They are maintained by their providers. The reality is that they will be maintained on best effort. To be clarified with the security team. Progress on 2006-11-27: being discussed with OSCT and SA3 Progress on 2006-12-04: This item was discussed during the meeting. Waiting for SA3 to create the final list of external packages which need to be maintained. Progress on 2007-01-15: Waiting for feedback from SA3. Progress on 2007-01-29: Waiting for feedback from SA3. There is a proposal to focus on removing all unnecessary dependencies within the gLite code which will probably impact this item, so the deadline will be extended. Progress on 2007-02-12: From Oliver: The progress to date is here; https://twiki.cern.ch/twiki/bin/view/EGEE/SourceTarballs In other words, some sorting of the externals; identification of what is maintained or out-of-date, what can be reclassified etc. There is a plan being drawn up for a big effort on reducing the external dependencies. If I remember correctly, this plan is due next week. I think it's best to wait for that. The work done so far will be fed into this plan. Progress on 2007-02-19: No update. Progress on 2007-03-05: (Update offline after the meeting) A gLite restructuring plan has been worked out by the integration, middleware and operations teams to make a radical examination of the code base with a view to removing unnecessary dependencies and cleaning up sections of the code that cause build and porting difficulties. We following tasks have to be well advanced before the execution of the plan can be started. The EGEE PMB monitors the progress every second week. - Move to SL4 on worker nodes and user interface - Move to ETICS build infrastructure - Stabilization and scalability of WMS and LB - Stabilization and scalability of the gLite-CE Even if the plan execution has not started, the developers have already started cleaning the dependencies while porting to the new build system. I would propose to close this action as it is being tracked somewhere else, and come back with information to the operations meeting once it is ongoing. |
OCC |
15/01/07 11/02/07 |
12/03/07 |
8 |
|
From COD report: RB-time_to_match test must be improved. Many alarms linked with this test were useless. To be fixed by UKI, Steve will open a GGUS ticket and assign it to them Progress on 2007-03-05: ticket 19319 opened https://gus.fzk.de/ws/ticket_info.php?ticket=19319&from=ID Progress on 2007-03-12: In progress. Progress on 2007-03-19: This is now fixed and the ticket can be closed. |
UKI ROC |
12/03/07 |
19/03/07 |
15 |
|
OCC to chase up a solution for ticket 19464 (https://gus.fzk.de/pages/ticket_details.php?ticket=19464) Progress on 2007-03-19: This ticket is now being handled by the gLite integration team. |
OCC |
26/03/07 |
19/03/07 |
16 |
|
LHCb to produce a list of services that will be affected due to the new SRM v1 endpoint at INFN-T1. Also, a coordinator for the intervention must be found. Progress on 2007-03-19: The list of services has been provided. The intervention coordinator is Marianne Bargiotti. |
LHCb |
26/03/07 |
19/03/07 |
17 |
|
LHCb wish to use dccp across the WAN to do some pre-staging. LHCb to provide the relevant tier 1 sites with the details of what is required (port numbers, port types, etc.) Progress on 2007-03-19: LHCb has directly contacted the sites involved (RAL, GridKA and IN2P3). RAL is in the process of opening the relevant ports. GridKA and IN2P3 have yet to confirm their position. |
LHCb |
26/03/07 |
19/03/07 |
13 |
- |
Some sites are accidentally publishing a local LFC as a global LFC. How might this be prevented? Progress on 2007-03-19: Steve spoke to
Judit who thinks that a solution can be implemented in the FCR tool. She
will check. Progress on 2007-03-26: no update at the meeting Progress on 2007-04-02: Judit has analysed the feasibility of providing this functionality using the FCR and it can be done. It will go into the work plan. |
Judit Novak |
2/04/07 |
2/04/07 |
18 |
- |
A middleware tool for carrying out the bulk removal files from SEs, appropriately updating the catalogues, etc. is needed. This will be put onto the List of Issues maintained by the ROCs and regularly presented to the TCG. Progress on 2007-04-02: This is now on the ROC Top Issues list, item 23 (https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_TCG) |
OCC |
02/04/07 |
02/04/07 |
19 |
- |
Request for mechanism so top level BDIIs can publish themselves Progress on 2007-03-26: bug opened by Syeve Traylen, to be raised at the EMT: https://savannah.cern.ch/bugs/?25033 Progress on 2007-04-02: This work has been scheduled to be done immediately after the release of gLite 3.1. |
OCC |
02/04/07 |
02/04/07 |
21 |
- |
Development/testing/certification/PPS status of MySQL LFC Progress on 2007-03-26: PPS: several local instances, plus the global LFC at PIC run MySQL. Certification: Both flavours of LFC are tested and have permanent installations on the testbed. To be closed. |
OCC |
02/04/07 |
02/04/07 |
22 |
- |
Estimation on when SLC4 WNs will be available in production Progress on 2007-04-02: The SL4 natively compiled WN is now in the PPS. It will be tested by the HEP VOs and when they are happy with it, the WN will be passed to production. Action to be closed. |
OCC |
02/04/07 |
02/04/07 |
23 |
- |
Estimation on when the unified version of RFIO client for DPM and castor will be in production Progress on 2007-03-26: update from the developers: no manpower, this is not expected this year. |
OCC |
02/04/07 |
02/04/07 |
15 |
- |
SA3 need to know when the larger sites plan to move to SL4 (and binary compatable) machines. This will allow better planning of the move of the middleware from SL3 to SL4. This information should be send to Maite.Barroso.Lopez@cern.ch and Nicholas.Thackray@cern.ch Progress on 2007-03-26: The following ROCs/sites have supplied information: SW Europe ROC/PIC, DECH/FZK, SARA, SE Europe ROC, NIKHEF. Please can the ROCs gather data from all remaining large sites. Progress on 2007-04-02: In progress. Progress on 2007-04-16: It can be closed, all major sites have given feedback, put together at the following wiki: https://twiki.cern.ch/twiki/bin/view/EGEE/Sites_Plans_to_go_to_SLC4 |
All EGEE ROCs |
16/04/07 |
26/03/07 |
24 |
ROC DECH |
Ask the R-GMA development team if they can attend the next grid operations meeting to answer questions on instabilities seen in the R-GMA system. Progress on 2007-04-16: Done, the action can be closed |
OCC |
16/04/07 |
16/04/07 |
25 |
All |
Ask SA3 to give an update at the next grid operations meeting on the status of the port to SL4 and also the relative priorities of the different middleware services. Progress on 2007-04-16: Done this can be closed |
OCC |
16/04/07 |
16/04/07 |
28 |
OSG |
Check that Laurence Field is the correct person for OSG to contact regarding their problems with the BDII. Progress on 2007-04-16: Yes, it is Laurence, please, contact him. The action can be closed. |
OCC |
16/04/07 |
16/04/07 |
29 |
CERN/ Triumf/ UK |
Escalate the problems being seen with the job wrapper tests to the ROC managers’ meeting. Progress on 2007-04-16: This was done, Piotr presented it at today’s meeting (see minutes). The action can be closed. |
Nick |
16/04/07 |
3/04/07 |
14 |
- |
Site HPC2N have the SE ibelieve-i.hpc2n.umu.se that they want to take out of production. Simone to look at the data on the SE and decide if it needs to be moved or can be thrown away. Progress on 2007-03-19: The site has provide ATLAS with a list of files which exist on the SE. ATLAS will go through this and decide what to do with the files. Per requests that this be carried out as quickly as possible. Progress on 2007-03-26: No progress due to ATLAS’s problems with CASTOR and the ATLAS week in Germany this week. Progress on 2007-04-02: No-one at the meeting knew of any update. Per will follow it up. Progress on 2007-04-16: Nobody from atlas online, chase offline Progress on 2007-04-23: As there was no response form Atlas, the files were removed. Contact the site admin if more info is required: ake.sandgren@hpc2n.umu.se This action can be closed. |
Simone C. / Per Oster |
23/04/07 |
26/03/07 |
34 |
SEE ROC |
gLite 21 update release notes stated that
reconfiguration is needed just for lcg-CE, lcg-CE_torque, and glite-CE, but
in fact you need to introduce new accounts on all WNs at the same time. This
is the list of GGUS ticket we crated so far: Progress on 2007-05-07: 2 tickets are now solved and one unsolved (Savannah bug opened), so the action can be closed. |
OCC - YAIM |
07/05/07 |
07/05/07 |
26 |
ROC SEE |
Check up on the status of the following tickets: https://gus.fzk.de/pages/ticket_details.php?ticket=18689 Progress on 2007-04-16: 18353: not solved but some activity, ongoing. 18698: no progress for a long time, no answer. To be raised again. Progress on 2007-04-23: These tickets have been raise to the EMT on 25/04/07. Some action will follow and will be reflected in the tickets. Progress on 2007-05-07: These tickets are now being handled. |
OCC |
07/05/07 |
16/04/07 |
31 |
COD (Russia) |
Solve first issue in the list of COD notes for today’s meeting: http://egee-docs.web.cern.ch/egee-docs/operational_tools/Operations_Meetings/2007/Weekly_Operations_Meeting_minutes_2007-04-23.htm Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lapp-rb01.in2p3.fr:7772) Progress on 2007-05-07: This is not an issue for the PPS to solve, but for the site to solve. The site should ask for help from their ROC and if the ROC cannot help then a ticket should be raised. Close. |
PPS coordination |
07/05/07 |
07/05/07 |
20 |
- |
Search or request documentation about how is a VOMS proxy mapped on a grid node (CE, SE, etc.) using LCMAPS. Maria Dimou provided pointers to docs from 2005. Waiting for input from Maarten and David Groep. Progress on 2007-04-02: In progress. Progress on 2007-04-16: The available information was collected and sent to Pierre. It is not enough, he needs the rules and they are not available. Suggestion to open a ticket to the developers. Progress on 2007-04-23: Pierre forgot to open the ticket, he will do it this week so this issue can be assigned and solved. Progress on 2007-05-07: a GGUS ticket was submitted (GGUS #21349) to ask developers for a complete LCMAPS administrator guide. The ticket has been raised to the EMT. Progress on 2007-05-14: Comment form the EMT: we will work on it as soon as possible. This action can be closed. |
OCC |
14/05/07 |
21/05/07 |
27 |
OCC |
Report on the outcome of the tests regarding grid site status information (taken from the information system, the WMS, the batch systems, etc.). Progress on 2007-04-16: Patricia reported at it during the meeting. We’ll leave it open till we have the outcome Progress on 2007-04-23: data form grid site tests were made available in MonAlisa. Alice is now checking that issues reported are actually consistent, that's the reasons for the detailed questions on the queues done to GRIDKA during this meeting. Progress on 2007-05-07: In progress. Alice to report at the next meeting. Progress on 2007-05-14: Patricia reported about this; see her report in the Alice section of the minutes. This action can be closed. |
Alice |
14/05/07 |
21/05/07 |
30 |
ROC SEE |
Create a wiki page to collect information about deployment of SL4 gLite services, even with workarounds: http://wiki.egee-see.org/index.php/SL4_WN Progress on 2007-04-23: not done yet. Does the SEE ROC volunteer to do it under the main SA1 wiki or at the GOC wiki? Kostas: no manpower. Volunteers are welcome. Progress on 2007-05-14: there are no volunteers, not many people interested, so ROC SEE agrees to close the action |
OCC |
23/04/07 |
21/05/07 |
32 |
CERN ROC, TRIUMF |
SAM still handles timezones incorrectly. Maintenance on Fri 20th scheduled for 14:00 - 16:00 UTC but SAM show maintenance incorrectly at 08:04 UTC and in error at 14:02 UTC, i.e. wrongly during our maintenance Judit: the timestamp is taken by SAM directly from the GOCDB as it is registered. GOCDB seems not to correctly convert time zones. GOC DB people will look into the problem together with Judit who will send all the details already found. Progress on 2007-05-07: This now looks like it might be a SAM issue. Still under investigation. Progress on 2007-05-14: Judit is following up Progress on 2007-05-21: Bug opened in Savannah to fix this: https://savannah.cern.ch/bugs/index.php?26500. As this issue will be tracked in Savannah, this action item will be closed. |
SAM, GOCDB |
04/06/07 |
21/05/07 |
35 |
UKI ROC |
Technical issues to do with the email that CIC-Portal
Alarms send: Progress on 2007-05-21: Both points have been addressed. Close this item. |
CIC team |
04/06/07 |
04/06/07 |