Weekly Operations' Meeting Action List

Status as of 11th November 2007

                                                                                                                                                                        Due Date colour key:

Red: action is overdue

Yellow: action is due at or before next meeting

White: action is due some time after the next meeting

Grey: Action closed at the last meeting.

Open Action Items

#

Raised by:

Description

Assigned

to:

Status

Due date

59

COD

Andy to produce a list of possible node and site states within the GOCDB. This can be input to further discussion

Update 2007-08-16: Andy has sent the information. The conversation is on-going.

Update 2007-08-20: In progress.

Update 2007-08-27: In progress.

Update 2007-09-10: In progress. To be discussed at EGEE 07

Update 2007-09-10: Was not discussed at EGEE 07 due to lack of time. Will continue to discuss here.

Update 2007-10-22: in progress.

Update 2007-10-29: No one present to comment.

Update 2007-11-05: No update received

Andy Newton

In progress

ASAP

64

UK/I

Proposal to deal with informing UIs to reconfigure when a WMS/LB changes it’s node name. This needs to be taken to the next ROC Managers' meeting.

Update 2007-10-22: The UI now can carry out service discovery. Steve Traylen is checking this. Need to check if the version of the WMS can be found through service discovery.

Update 2007-10-29: Pending a GGUS ticket. https://gus.fzk.de/pages/ticket_details.php?ticket=28373

Update 2007-11-05: No update received

Nick

In progress

ASAP

67

ATLAS

Track queues showing VOViews problems at the weekly operations meeting.

Update 2007-10-22: There are now 90 queues still in question which is less than two weeks ago. Carry on as is while the number continues to go down.

Update 2007-11-05: Details on the tests provided by Simone Campana (available in this agenda). He reports: ‘There are currently 65 queues with problems. It would be nice if some action could be taken’

Ops coord / ATLAS

In progress

ASAP

68

 

 

 

 

 


Closed Action Items

#

Raised by:

Description

Assigned

To:

Due date

Date closed

71

 

ATLAS to check if they know of any conflicts between SL kernel version 2.6 and either the application software or the middleware.

Progress on 2006-12-11: Atlas was not present at the meeting, we’ll check offline

Progress on 2006-12-18:

Alessandro DeSalvo says in respect of the ATLAS application there is no problem. The only issue might raise (but not sure at all) from the Oracle client in the production system (which anyway has only 4 instances in the all Grid) and the Data Management Clients in the VOBOXES. So as long as this discussion does not refer to VOBOXes, this is OK. The VOBOXes (only 10 nodes for atlas, one at each T1) will need to be considered some time soon. I will get in touch with Miguel for this.

The action can be closed.

Simone

11/12/06

18/12/06

76

 

VOs to update their mailing lists of grid users so that grid operational messages are communicated to all users when necessary.

Progress on 2007-01-08: The affected VOs have been contacted and the mailing lists updated. The action can be closed.

OCC / VOs

15/01/07

15/01/07

75

 

Provide DPM to ATLAS for testing purposes in PPS service.

Progress on 2007-01-08: 2 DPM nodes are now intalled at CERN PPS site. In progress

Progress on 2007-01-15: This is now done. ATLAS will use lxb2058 which is a DPM at the CERN_PPS site.

Nick

15/01/07

15/01/07

3

 

Maite to forward e-mail regarding solutions to APEL problems at IFAE, to the SWE ROC.

Progress on 2007-01-18: This problem is now fixed. The action can be closed.

Maite

29/01/07

29/01/07

2

 

SA3 to set up a wiki page to give guidance on renewing host certificates for the different grid services. OCC to circulate the wiki page and ask for feedback on it’s suitability.

Progress on 2007-01-15: The wiki pages can be found here:

https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to_replace_host_certificates

Progress on 2007-01-29: There has been no negative feedback so this item will be closed.

OCC

29/01/07

29/01/07

4

 

Publish links to all SRM monitoring.

Progress on 2007-01-18: The link to the SAM SRM monitoring is:

https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=SRM&regions=CERN&regions=France&regions=UK_Ireland&regions=
GermanySwitzerland&>regions=Italy&regions=CentralEurope&regions=NorthernEurope&regions=
SouthEasternEurope&regions=SouthWesternEurope&regions=Russia&regions=AsiaPacific&regions=IU_iuatlas&
regions=USCMS&regions=USATLAS&vo=ops&order=RegionName&funct=ShowSensorTests

The monitoring used by the FTS experts is a prototype and can be found here: http://pcitgm02.cern.ch:8081/

Progress on 2007-01-29: Completed. This item can be closed.

OCC

29/01/07

29/01/07

67

 

Timescale for move to Torque2?

Progress on 2006-10-30: In progress.

Progress on 2006-11-01: Expected to be in certification within 2-3 weeks.

Progress on 2006-11-20: The counter was restarted last Friday, so it will go to PPS in 2-3 weeks from now.

Estimated timeline: in PPS by ~15th Dec

Progress on 2006-12-18: it was released to PPS on Monday, and removed on Tuesday due to a critical problem found.

Progress on 2007-01-15: This is now being certified at CERN and by one SA3 partner in Greece.

Progress on 2007-01-29: Still in certification.

Progress on 2007-02-12: Waiting certification report from SA3 greek partner

Progress on 2007-02-19: The patch containing this should be in the PPS next week.

Progress on 2007-02-26: The patch is included in gLite 3.0 PPS-update 20, which is being deployed. The action can be closed.

OCC

29/01/07

26/02/07

1

 

Document VO expiration procedure and associated error message when it happens.

Progress on 2007-01-15: This will be put into the GOC wiki.

Progress on 2007-01-29: In progress.

Progress on 2007-02-12: Maite will update it this week once she gets access to the GOC wiki

Progress on 2007-02-19: In progress.

Progress on 2007-02-26: Gocwiki updated. This action can be closed.

OCC

22/01/07

26/02/07

6

 

Item for COD agenda: Running the RB SAM tests on demand. Should this be a request to the test developers or should the CODs be in the OPS VO?

Progress on 2007-02-12: Report about the conclusion once the COD minutes are published

Progress on 2007-02-19: The minutes of the COD meeting are not yet published.

Progress on 2007-02-26: The minutes of the COD meeting are not yet published.

Progress on 2007-03-05: This has been solved by finding a second RB to send SAM tests. The action can be closed.

Helene

12/02/07

05/03/07

05/03/07

7

 

Item for COD agenda: Filtering out of SAM site test failures due to the failure of a grid central service.

Progress on 2007-02-12: Report about the conclusion once the COD minutes are published.

Progress on 2007-02-19: The minutes of the COD meeting are not yet published.

Progress on 2007-02-26: The minutes of the COD meeting are not yet published.

Progress on 2007-03-05: A 3 point strategy has been defined at last COD. They are looking for people to implement it:

http://goc.grid.sinica.edu.tw/gocwiki/Tools_Improvements_for_COD#head-619c8018ab3d7bb773674324b16d97e51bc82b98

However, timeline and details need to be refined by SAM Team as much part of the work is to be done at SAM and they are closer to the thing.

The action can be closed.

Helene

12/02/07

26/02/07

05/03/07

9

 

Request from DM developers, they are testing SRM 2.2, gfal and fts need to know the version of srm they are talking to. Test will be added to gstat by the end of this week.

Progress on 2007-03-05: this should go in as a WARN this week to raised to ERROR later in a couple of weeks.

This action can be closed.

Gstat team

12/03/07

05/03/07

10

 

Escalate ticket 18279, WMS condorc-luncher files filling /tmp, raised by the DECH ROC, to the EMT

Progress on 2007-03-05: Tmpwatch can be configured to clean those files up more often, even once a day, if needed.
The location and verbosity of those files was made configurable as of Condor version 6.8.4
To the best of my understanding this Condor version is being tested for further distribution, but this issue is closed as far as development goes.
The Condor version is 6.8.4:

https://savannah.cern.ch/patch/?1062

The configuration is being worked on now, so it will be made available in around a month.

The action can be closed.

OCC

12/03/07

12/03/07

11

 

Point for next meeting agenda: information about WMS – RB deployment strategy

Progress on 2007-03-05: Ian Bird will attend next meeting to give a status update on this.

The action can be closed.

OCC

12/03/07

12/03/07

70

 

Conclude on "Policy for security updates of third party software".

The gLite integration team policy is: the external packages are not guaranteed to be maintained. They are provided for convenience. They are maintained by their providers.

The reality is that they will be maintained on best effort.

To be clarified with the security team.

Progress on 2006-11-27: being discussed with OSCT and SA3

Progress on 2006-12-04: This item was discussed during the meeting. Waiting for SA3 to create the final list of external packages which need to be maintained.

Progress on 2007-01-15: Waiting for feedback from SA3.

Progress on 2007-01-29: Waiting for feedback from SA3. There is a proposal to focus on removing all unnecessary dependencies within the gLite code which will probably impact this item, so the deadline will be extended.

Progress on 2007-02-12: From Oliver: The progress to date is here;

https://twiki.cern.ch/twiki/bin/view/EGEE/SourceTarballs

In other words, some sorting of the externals; identification of what is maintained or out-of-date, what can be reclassified etc.

There is a plan being drawn up for a big effort on reducing the external dependencies. If I remember correctly, this plan is due next week. I think it's best to wait for that. The work done so far will be fed into this plan.

Progress on 2007-02-19: No update.

Progress on 2007-03-05: (Update offline after the meeting)

A gLite restructuring plan has been worked out by the integration, middleware and operations teams to make a radical examination of the code base with a view to removing unnecessary dependencies and cleaning up sections of the code that cause build and porting difficulties.

We following tasks have to be well advanced before the execution of the plan can be started. The EGEE PMB monitors the progress every second week.

- Move to SL4 on worker nodes and user interface

- Move to ETICS build infrastructure

- Stabilization and scalability of WMS and LB

- Stabilization and scalability of the gLite-CE

Even if the plan execution has not started, the developers have already started cleaning the dependencies while porting to the new build system.

I would propose to close this action as it is being tracked somewhere else, and come back with information to the operations meeting once it is ongoing.

OCC

15/01/07

11/02/07

12/03/07

8

 

From COD report: RB-time_to_match test must be improved. Many alarms linked with this test were useless. To be fixed by UKI, Steve will open a GGUS ticket and assign it to them

Progress on 2007-03-05: ticket 19319 opened

https://gus.fzk.de/ws/ticket_info.php?ticket=19319&from=ID

Progress on 2007-03-12: In progress.

Progress on 2007-03-19: This is now fixed and the ticket can be closed.

UKI ROC

12/03/07

19/03/07

15

 

OCC to chase up a solution for ticket 19464 (https://gus.fzk.de/pages/ticket_details.php?ticket=19464)

Progress on 2007-03-19: This ticket is now being handled by the gLite integration team.

OCC

26/03/07

19/03/07

16

 

LHCb to produce a list of services that will be affected due to the new SRM v1 endpoint at INFN-T1. Also, a coordinator for the intervention must be found.

Progress on 2007-03-19: The list of services has been provided. The intervention coordinator is Marianne Bargiotti.

LHCb

26/03/07

19/03/07

17

 

LHCb wish to use dccp across the WAN to do some pre-staging. LHCb to provide the relevant tier 1 sites with the details of what is required (port numbers, port types, etc.)

Progress on 2007-03-19: LHCb has directly contacted the sites involved (RAL, GridKA and IN2P3). RAL is in the process of opening the relevant ports. GridKA and IN2P3 have yet to confirm their position.

LHCb

26/03/07

19/03/07

15

-

SA3 need to know when the larger sites plan to move to SL4 (and binary compatable) machines. This will allow better planning of the move of the middleware from SL3 to SL4. This information should be send to Maite.Barroso.Lopez@cern.ch and Nicholas.Thackray@cern.ch

Progress on 2007-03-26: The following ROCs/sites have supplied information:

SW Europe ROC/PIC, DECH/FZK, SARA, SE Europe ROC, NIKHEF. Please can the ROCs gather data from all remaining large sites.

Progress on 2007-04-02: In progress.

Progress on 2007-04-16: It can be closed, all major sites have given feedback, put together at the following wiki:

https://twiki.cern.ch/twiki/bin/view/EGEE/Sites_Plans_to_go_to_SLC4

All EGEE ROCs

16/04/07

26/03/07

14

-

Site HPC2N have the SE ibelieve-i.hpc2n.umu.se that they want to take out of production. Simone to look at the data on the SE and decide if it needs to be moved or can be thrown away.

Progress on 2007-03-19: The site has provide ATLAS with a list of files which exist on the SE. ATLAS will go through this and decide what to do with the files. Per requests that this be carried out as quickly as possible.

Progress on 2007-03-26: No progress due to ATLAS’s problems with CASTOR and the ATLAS week in Germany this week.

Progress on 2007-04-02: No-one at the meeting knew of any update. Per will follow it up.

Progress on 2007-04-16: Nobody from atlas online, chase offline

Progress on 2007-04-23: As there was no response form Atlas, the files were removed. Contact the site admin if more info is required: ake.sandgren@hpc2n.umu.se

This action can be closed.

Simone C. / Per Oster

23/04/07

26/03/07

13

-

Some sites are accidentally publishing a local LFC as a global LFC. How might this be prevented?

Progress on 2007-03-19: Steve spoke to Judit who thinks that a solution can be implemented in the FCR tool. She will check.
The requirement for the FCR can be found here: https://savannah.cern.ch/bugs/index.php?24812

Progress on 2007-03-26: no update at the meeting

Progress on 2007-04-02: Judit has analysed the feasibility of providing this functionality using the FCR and it can be done. It will go into the work plan.

Judit Novak

2/04/07

2/04/07

18

-

A middleware tool for carrying out the bulk removal files from SEs, appropriately updating the catalogues, etc. is needed. This will be put onto the List of Issues maintained by the ROCs and regularly presented to the TCG.

Progress on 2007-04-02: This is now on the ROC Top Issues list, item 23 (https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_TCG)

OCC

02/04/07

02/04/07

19

-

Request for mechanism so top level BDIIs can publish themselves

Progress on 2007-03-26: bug opened by Syeve Traylen, to be raised at the EMT:

https://savannah.cern.ch/bugs/?25033

Progress on 2007-04-02: This work has been scheduled to be done immediately after the release of gLite 3.1.

OCC

02/04/07

02/04/07

21

-

Development/testing/certification/PPS status of MySQL LFC

Progress on 2007-03-26: PPS: several local instances, plus the global LFC at PIC run MySQL.

Certification: Both flavours of LFC are tested and have permanent installations on the testbed. To be closed.

OCC

02/04/07

02/04/07

22

-

Estimation on when SLC4 WNs will be available in production

Progress on 2007-04-02: The SL4 natively compiled WN is now in the PPS. It will be tested by the HEP VOs and when they are happy with it, the WN will be passed to production. Action to be closed.

OCC

02/04/07

02/04/07

23

-

Estimation on when the unified version of RFIO client for DPM and castor will be in production

Progress on 2007-03-26: update from the developers: no manpower, this is not expected this year.

OCC

02/04/07

02/04/07

29

CERN/ Triumf/ UK

Escalate the problems being seen with the job wrapper tests to the ROC managers’ meeting.

Progress on 2007-04-16: This was done, Piotr presented it at today’s meeting (see minutes). The action can be closed.

Nick

16/04/07

3/04/07

24

ROC DECH

Ask the R-GMA development team if they can attend the next grid operations meeting to answer questions on instabilities seen in the R-GMA system.

Progress on 2007-04-16: Done, the action can be closed

OCC

16/04/07

16/04/07

25

All

Ask SA3 to give an update at the next grid operations meeting on the status of the port to SL4 and also the relative priorities of the different middleware services.

Progress on 2007-04-16: Done this can be closed

OCC

16/04/07

16/04/07

28

OSG

Check that Laurence Field is the correct person for OSG to contact regarding their problems with the BDII.

Progress on 2007-04-16: Yes, it is Laurence, please, contact him. The action can be closed.

OCC

16/04/07

16/04/07

26

ROC SEE

Check up on the status of the following tickets:

https://gus.fzk.de/pages/ticket_details.php?ticket=18689
https://gus.fzk.de/pages/ticket_details.php?ticket=18353

Progress on 2007-04-16:

18353: not solved but some activity, ongoing.

18698: no progress for a long time, no answer. To be raised again.

Progress on 2007-04-23: These tickets have been raise to the EMT on 25/04/07. Some action will follow and will be reflected in the tickets.

Progress on 2007-05-07: These tickets are now being handled.

OCC

07/05/07

16/04/07

34

SEE ROC

gLite 21 update release notes stated that reconfiguration is needed just for lcg-CE, lcg-CE_torque, and glite-CE, but in fact you need to introduce new accounts on all WNs at the same time. This is the list of GGUS ticket we crated so far:
https://gus.fzk.de/pages/ticket_details.php?ticket=20941
https://gus.fzk.de/pages/ticket_details.php?ticket=20942
https://gus.fzk.de/pages/ticket_details.php?ticket=21044
To conclude, I would say that release note missed to mention some important things yet again, and that it can badly affect VOs that massively use prd or sgm accounts

Progress on 2007-05-07: 2 tickets are now solved and one unsolved (Savannah bug opened), so the action can be closed.

OCC - YAIM

07/05/07

07/05/07

31

COD (Russia)

Solve first issue in the list of COD notes for today’s meeting: http://egee-docs.web.cern.ch/egee-docs/operational_tools/Operations_Meetings/2007/Weekly_Operations_Meeting_minutes_2007-04-23.htm

Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lapp-rb01.in2p3.fr:7772)

Progress on 2007-05-07: This is not an issue for the PPS to solve, but for the site to solve. The site should ask for help from their ROC and if the ROC cannot help then a ticket should be raised. Close.

PPS coordination

07/05/07

07/05/07

20

-

Search or request documentation about how is a VOMS proxy mapped on a grid node (CE, SE, etc.) using LCMAPS.

Maria Dimou provided pointers to docs from 2005. Waiting for input from Maarten and David Groep.

Progress on 2007-04-02: In progress.

Progress on 2007-04-16: The available information was collected and sent to Pierre. It is not enough, he needs the rules and they are not available. Suggestion to open a ticket to the developers.

Progress on 2007-04-23: Pierre forgot to open the ticket, he will do it this week so this issue can be assigned and solved.

Progress on 2007-05-07: a GGUS ticket was submitted (GGUS #21349) to ask developers for a complete LCMAPS administrator guide. The ticket has been raised to the EMT.

Progress on 2007-05-14: Comment form the EMT: we will work on it as soon as possible. This action can be closed.

OCC

16/04/07

14/05/07

21/05/07

27

OCC

Report on the outcome of the tests regarding grid site status information (taken from the information system, the WMS, the batch systems, etc.).

Progress on 2007-04-16: Patricia reported at it during the meeting. We’ll leave it open till we have the outcome

Progress on 2007-04-23: data form grid site tests were made available in MonAlisa. Alice is now checking that issues reported are actually consistent, that's the reasons for the detailed questions on the queues done to GRIDKA during this meeting.

Progress on 2007-05-07: In progress. Alice to report at the next meeting.

Progress on 2007-05-14: Patricia reported about this; see her report in the Alice section of the minutes. This action can be closed.

Alice

14/05/07

21/05/07

30

ROC SEE

Create a wiki page to collect information about deployment of SL4 gLite services, even with workarounds:

http://wiki.egee-see.org/index.php/SL4_WN

Progress on 2007-04-23: not done yet. Does the SEE ROC volunteer to do it under the main SA1 wiki or at the GOC wiki? Kostas: no manpower. Volunteers are welcome.

Progress on 2007-05-14: there are no volunteers, not many people interested, so ROC SEE agrees to close the action

OCC

23/04/07

21/05/07

32

CERN ROC, TRIUMF

SAM still handles timezones incorrectly. Maintenance on Fri 20th scheduled for 14:00 - 16:00 UTC but SAM show maintenance incorrectly at 08:04 UTC and in error at 14:02 UTC, i.e. wrongly during our maintenance

Judit: the timestamp is taken by SAM directly from the GOCDB as it is registered. GOCDB seems not to correctly convert time zones. GOC DB people will look into the problem together with Judit who will send all the details already found.

Progress on 2007-05-07: This now looks like it might be a SAM issue. Still under investigation.

Progress on 2007-05-14: Judit is following up

Progress on 2007-05-21: Bug opened in Savannah to fix this: https://savannah.cern.ch/bugs/index.php?26500. As this issue will be tracked in Savannah, this action item will be closed.

SAM, GOCDB

04/06/07

21/05/07

33

CERN ROC, FNAL

We set up a 2nd lcg gateway for redundancy. But if either goes down, SAM flags us as being down, thereby defeating the purpose of the 2nd gateway. Of course we are still operational, only SAM is marking us incorrectly. How can this be improved? I was told CERN runs multiple gateways, how do they handle this? 2. We need to split the cmswnNNN accounts on the 2 gateways since they operate independently

Judit: SAM considers the gateways separately as two computing elements and they are both monitored.

The issue has to be followed off-line between SAM support and FNAL.

Progress on 2007-05-14:Judit is following up

Progress on 2007-05-21: SAM support didn't find the problem they reported, and has explained them that they only see the expected behavior of SAM; any additional comment from the originators or can we close this action?

Update from FNAL: FNAL is checking again if they appear as completely down when only one CE is down. Test being done now, more news next week.

Progress on 2007-04-06: according to Joe Kaiser and Judit Novak, the issue is solved.

SAM

14/05/07

4/06/07

21/05/07

37

ROC France

YAIM dosen’t support multiple clusters/sub-clusters per CE. A bug needs to be submitted for this.

Progress on 2007-04-06: https://savannah.cern.ch/bugs/?26757

To be closed.

Steve Traylen

04/06/07

21/05/07

35

UKI ROC

Technical issues to do with the email that CIC-Portal Alarms send:
a) The From field should be CIC-Portal@in2p3.fr and not just CIC-Portal. Otherwise intervening mail relays add their own spurious @host info and so the mail can be misidentified by mail browsers.
b) All emails from CIC-Portal, and in2p3.fr generally, are given a Spam-Assassin rating of DNS_FROM_RFC_ABUSE 0.37, plus whatever other spam score the contents of the message might incur. This would be avoided if in2p3.fr got itself de-listed from www.rfc-ignorant.com - that shouldn't be hard!

Progress on 2007-05-21: Both points have been addressed. Close this item.

CIC team

04/06/07

04/06/07

38

ROC UKI

UKI-SCOTGRID-GLASGOW had to clear jobs which were stalled due to lcg-cr commands hanging

(http://scotgrid.blogspot.com/2007/05/users-and-stalled-jobs.html). No response from biomed user who was responsible for most of these (https://gus.fzk.de/pages/ticket_details.php?ticket=22717). User will be banned from our site if no response is forthcoming.

We believe this is a reasonable policy for our site, but are there official guidelines on this?

Steve has contacted Romain to see if there is anything existing on this policy area.

Romain forwarded a proposal to the OSCT, under discussion.

Progress on 2007-06-25: mail sent to Romain to check progress

Progress on 2007-07-02: The ticket has been solved and closed. Seemed to be a problem on the remote SE, which was uncovering a low level bug in the gridftp v1 code. See also https://gus.fzk.de/ws/ticket_info.php?ticket=22634

Steve/Romain

02/07/07

18/06/07

39

DECH ROC

ATLAS to supply a list of sites at which the FQAN VOviews tags should be removed.

Progress on 2007-06-25: Done, Simone sent this last week. It is now linked to the agenda page.

Simone Campana / ATLAS

25/06/07

25/06/07

41

OCC

FTS development team and integration release team to remove FTS 2.0 from production repository

Progress on 2007-07-02: Gavin distributed a proposal on 3/07/07 to the T1s about how to proceed:

"To avoid prematurely upgrading to FTS 2.0 from the production repository the gLite integration team suggest disabling or removing the gLite apt or yum config files on the FTS server nodes, since going back to previous gLite release it not trivial with the current system."

If we do the staged-release thing again, we will keep the 'new' version in the PPS repository until it is time for wider deployment.

FYI. The patch for the new version of FTS 2.0 has entered gLite integration (with a variety of bug fixes, as found on the FTS pilot service and on CERN-PROD)

Progress on 2007-07-09: check with Gavin

Progress on 2007-07-16: no objection from T1s. the action can be closed.

FTS team

16/07/07

02/07/07

45

ROC France

Accounting portal isn’t showing data for the last 3 months.

Progress on 2007-07-09: It works now, it can be closed.

ROC SWE

9/07/07

9/07/07

44

ROC France

Schedule the updating of the information system in the production service (top-level BDII, site BDII, etc.)

Update 3/07/07: The broadcast was sent today. This item can be closed.

OCC

3/07/07

9/07/07

43

ROC France

Submit a change request to the GOCDB to have a flag which shows whether a service is considered as being in production.

Progress on 2007-07-09: request submitted and accepted by the GOCDB team. This action can be closed.

ROC France

9/07/07

9/07/07

40

OCC

Publish dates and features of the next gridview release

Progress on 2007-07-02: After the discussions between the gridview team and WLCG a decision was made that the sites have to confirm if the new availability calculation is acceptable for them.

This means that the algorithm will have to be presented to them (not yet agreed exactly when and how), and a release can come only following that.

So for the time being Gridview can't present neither the new features coming with the next release (as they might need to be changed), nor the approximate time of the release (unsure yet).

SAM/Gridview

02/07/07

09/07/07

42

PPS

Decide whether the severity of the broker-info test needs to be temporarily down graded from critical.

Update 3/07/07: The problem was found to be that the LINUX “which” command was not supported at the site. It is expected that this command should be available at all sites and so the SAM test does not need to be changed. This item can be closed.

OCC

02/07/07

9/07/07

46

ROC Russia

Russian COD team to send to ROC managers the list of ~10 CEs unregistered in GOCDB and monitored by SAM (because they appear in the production top-level BDIIs)

Progress on 2007-07-16: Can’t now do this as the data is transitory and can’t be recreated. COD will raise tickets when they come across similar sites. Close the item.

ROC Russia

16/07/07

16/07/07

48

ROC France

Make public the official URLs to be used for a TOP BDII configuration

Progress on 2007-07-16: Top URL for BDII configurations has been put on the CERN-ROC page. Close the item.

ROC CERN

16/07/07

16/07/07

47

ROC France

In the weekly ROC report, we can fill the T1 availability text field, but it's impossible to consult it afterwards. Please update the consultation of ROC report to make available this field.

Progress on 2007-07-23: Field is now available to be consulted. Few improvements requested by Maite and Alberto Aimar to Osman are being implemented. The action can be closed.

CIC team

16/07/07

30/07/07

49

OCC

Tier 1 sites to include status updates of their migration to the SL4 WN in their weekly site reports.

Progress on 2007-07-23: Some sites/ROC already reported for today’s meeting. See agenda/minutes. We’ll keep it as a recurrent point in next week’s agendas, till the end of August, so the action can be closed.

All tier 1 sites

31/08/07

30/07/07

50

UKI ROC

All ROCs to check with sites in one week notice is enough to change firewall rules as a consequence of the change of HW and IP address for the R-GMA registry at RAL. The change will take place in September.

Progress on 2007-07-30: There are no objections, so RAL can proceed with the plan. They will send a broadcast and also give notification at the operations meeting one week in advance of their intervention, starting from September.

All ROCs

06/08/07

30/07/07

34

SWE ROC

What is the status of the VO configuration tool in the CIC portal? Is it available?

Progress on 2007-05-14: work in progress, we’ll report in the coming weeks

Progress on 2007-04-06: update requested by mail to the CIC portal team:

The Oracle migration is complete. So are the small modifications of the way VO registrations are handled in the YAIM VO Configurator tool (the use of the CIC VO table and the difference in the way "official" and "non-official" registrations are handled). What is needed is just some more tests before i commit the code in the IN2P3 CVS repository

Progress on 2007-06-25: 1st version being tested by the CIC team. It will be put in production with an intermediate step: web service at cern and DB at IN2P3; this does not disturb normal operation. Full integration later during the summer.

Progress on 2007-07-02: YAIM VO Configurator is available at the following URL:
https://yaim.fmi.uni-sofia.bg/yaim/yaim.py
This URL will stay stable untill the configurator is finally hosted on the CIC's web cluster.
As stated on the page, it supports the new vo.d format and stores the configuration data in the CIC's database.
Dimitar, Oliver & myself have agreed to go on with the integration process in August.

Progress on 2007-07-09: summarize status after demo with Dimitir. Next YAIM version will still have the VO examples, as the YAIMtool will not be ready.

Progress on 2007-07-16: Work to be done at the CIC portal waiting for Gilles to be back from vacation; will resume at the beginning of August. Wait till then.

Progress on 2007-08-06: In progress. Completion expected by the end of August. See the minutes of this meeting. Close.

Hélène Cordier

13/08/07

06/08/07

52

Several sites

Request CMS and LHCb to state at the next meeting if they have any requirements for (non-tier-1 sites) sites moving to SL4/gLite 3.1

Update 2007-08-08:

CMS gave the following statement:

All CMS sites should move to SL4 as soon as possible. CMSSW_1_5 and successive will only be built for SL4.

(The source of the statement is Stefano Belforte)

LHCb gave the following statement:

LHCb are not pushing sites to migrate to SL4. Until recently, LHCb faced some issues running on SL4 WNs because of the previously reported lcg-cp problem which took time to understand. LHCb are now splitting the CEs into SL3 and SL4 such that when issues arise, it is clear as to which version of the OS they are related to. Until the workaround for the lcg-cp problem is fixed in DIRAC (most likely next week) LHCb are preventing their jobs going to SLC4 sites.

After that, just like ATLAS and ALICE LHCb shall gladly use SL4 resources. They will even try and use 64-bit application whenever possible in order to gain an additional 30% on the performance.

(The source of the statement is Philippe Charpentier)

Update 2007-08-13: Responses have now been received from all experiments and so this item can be closed.

CMS / LHCb

20/08/07

13/08/07

53

SouthEast Europe ROC

Announce to all VOs the decision of the operations meeting that by default sites will configure the VOs they support with pool accounts for SGM and PRD. If a VO does not want this then they should make this clear on their VO ID card.

Update 2007-08-08: Announcement sent to the VO managers and CC to Fred Schaer (Coordinator of the EGEE VO managers).

Update 2007-08-13: This item can be closed.

OCC

20/08/07

13/08/07

56

Central Europe ROC

When will gLite be ported to ia64 (Itanium)? What is the proposed support plan for ia64?

Update 2007-08-08: This will be discussed at the next operations meeting (13th August).

Update 2007-08-13: Discussed during the meeting. Close.

SA3

20/08/07

13/08/07

57

PIC

CMS requested to transfer files from PIC to FNAL using SRM copy mode instead of the default URL copy. They are asking for specific configuration with dedicated channels. We would like for some clarifications on this subject.

Update 2007-08-08: (From Gavin) The FTS setup for CMS is still under discussion. Gavin will follow this up. Close?

Update 2007-08-13: Close.

Gavin/Steve

20/08/07

13/08/07

61

COD

SAM to suppress alarms when a site is in downtime.

Update 2007-08-16: SAM should already supress alarms for sites/nodes that are in scheduled downtime. If this is seen again, please raise a GGUS ticket. Close.

SAM

ASAP

20/08/07

62

BNL

The plots of hourly report for VO "OPS", and Tier 1 site at BNL seems not right. It is strange that all individual services are green and the overall services showed different result. How is the overall service generated? Can this problem be fixed?

Update 2007-08-16: There wasn’t a problem. The site BDII test was failing, hence not all services were green. Close.

SAM

ASAP

20/08/07

51

SA3

Would there be any problems with using the externally maintained “DAG” repository for external dependencies of the middleware?

Update 2007-08-08: All ROCs asked to contact their sites for feedback. No response by 20 August will be taken as agreement with the proposal to use the DAG repository.

Update 2007-08-16: Replies to the proposition can be found here:

http://egee-docs.web.cern.ch/egee-docs/operational_tools/Operations_Meetings/2007/Responses_to_proposed_use_of_DAGS_repository_(up_to_Thurs_16_Aug).txt

Update 2007-08-20: Discussed during the meeting. See the minutes. Close

EGEE ROCs / sites

20/08/07

20/08/07

54

WLCG

All tier-1 sites to report at each meeting until the end of August on their status and plans regarding moving to SL4 / gLite 3.1 WNs.

Action expired. Close.

Tier-1 sites

27/08/07

27/08/07

55

Several ROCs

Find out what are the plans to test SRM 2.2 in the production service.

Update 2007-08-08: This topic will be discussed at the LCG Management Board this week so there should be some feedback for the grid operations meeting on 13th August.

Update 2007-08-13: Last week’s MB was cancelled. This item will be discussed tomorrow.

Update 2007-08-20: Discussed during the meeting. See that minutes. Close.

OCC

20/08/07

20/08/07

58

PIC

PIC has deployed SL4 WNs. To check if these WNs are OK for LHCb they have been using the LHCb specific SAM testes and everything seems OK, however the LHCb software seems to be not properly installed on our WNs. LHCb to give comment on this.

Update 2007-08-13: LHCb not at the meeting so no update.

Update 2007-08-20: e-mail sent to PIC and LHCb. This will be followed up off-line. Close.

LHCb

As soon as possible

20/08/07

60

COD

PPS to change how the SAM monitoring is carried out so that the certificate used for submitting the jobs is associated with only one VO.

Update 2007-08-20: Done (see PPS section of the minutes). Close.

PPS

20/08/07

20/08/07

63

OCC

SAM team to present first draft of process for user testing and announcement of updates.

Update 2007-08-27: Done. Close.

SAM

27/08/07

27/08/07

12

-

Clarify site implications from SE downtime, as tests on CE of course fail as well since they need a default SE.

Should the site be put in complete downtime? (Note from the minute taker, the default SE could be changed to some other site during the SEs downtime)

How does this affect the VOs?

Progress on 2007-03-26: In progress. Maite is coordinating.

Progress on 2007-04-16: No news, to follow up

Progress on 2007-04-23:

Proposal to follow the SAM site availability calculation rules:

http://goc.grid.sinica.edu.tw/gocwiki/SAM_Metrics_calculation#head-a5fda62884f5a769de0fa7c3532622a9f94fb40c

 

Which would mean:

- for sites with only one SE, if the SE is down, the whole site will be considered as down, so schedule site downtime

- for sites with more than 1 SE, if only one (or less than all) will be down, schedule downtime only for that SE

- for sites with more than 1 SE, if all will be down, schedule site downtime

 

Summarizing:

If a site has no SEs available, it should be declared as down

Progress on 2007-05-07: There is a long running e-mail discussion on this.

Progress on 2007-05-07: no update

Progress on 2007-05-21: In progress.

Progress on 2007-06-25: this issue has been raised again at today’s meeting. I’ll put all available information in a wiki and try to come up with a proposal.

Progress on 2007-07-09: in progress

Progress on 2007-07-16: Progress information available at

https://savannah.cern.ch/task/?5222

Progress on 2007-08-06: Still in progress but an outcome is expected soon. See the above Savannah task for details.

Progress on 2007-08-13: Still in progress.

Progress on 2007-08-20: Still in progress.

Progress on 2007-08-27: Still in progress.

Progress on 2007-09-10: Still in progress.

Progress on 2007-10-08: Still in progress. This is now tracked in the SA1 Technical Issues project in Savannah and so this action item will be closed. The URL to the Savannah task is:

https://savannah.cern.ch/task/?5222

OCC

As soon as possible

15/10/07

36

SWE ROC

Items for COD meeting agenda in Stockholm: How to deal with failing PPS sites.

Progress on 2007-04-06: It is now included in COD’s agenda. Wait for report and conclusion

Progress on 2007-06-18: This was discussed at the COD meeting and operations workshop in Stockholm last week. The discussion is still ongoing.

Progress on 2007-06-25: mail sent to PPS to inquire about the conclusion

Progress on 2007-07-02: no progress, Helene and Nick to discuss

Progress on 2007-07-09: in progress

Progress on 2007-07-30: No progress due to people on vacation. Will resume when Hélène returns.

Progress on 2007-08-20: This is being discussed by the PPS site administrators. When an agreement has been reached there, a discussion will be had with the CODs.

Progress on 2007-08-20: In progress.

Progress on 2007-09-17: In progress. To be discussed at EGEE 07

Progress on 2007-09-17: This was discussed and agreed at EGEE 07. In brief, the CODs will no longer raise tickets for the PPS but instead a weekly report will be generated which can be used to monitor how well sites are monitoring and fixing themselves. All PPS sites will sign up to the RSS feed from the CIC Portal.

OCC

04/06/07

As soon as possible

15/10/07