Weekly Operations' Meeting Action List

Status as of 21st May 2007

                                                                                                                                                                        Due Date colour key:

Red: action is overdue

Yellow: action is due at or before next meeting

White: action is due some time after the next meeting

Grey: Action closed at the last meeting.

Open Action Items

#

Raised by:

Description

Assigned

to:

Status

Due date

12

-

Clarify site implications from SE downtime, as tests on CE of course fail as well since they need a default SE.

Should the site be put in complete downtime? (Note from the minute taker, the default SE could be changed to some other site during the SEs downtime)

How does this affect the VOs?

Progress on 2007-03-26: In progress. Maite is coordinating.

Progress on 2007-04-16: No news, to follow up

Progress on 2007-04-23:

Proposal to follow the SAM site availability calculation rules:

http://goc.grid.sinica.edu.tw/gocwiki/SAM_Metrics_calculation#head-a5fda62884f5a769de0fa7c3532622a9f94fb40c

 

Which would mean:

- for sites with only one SE, if the SE is down, the whole site will be considered as down, so schedule site downtime

- for sites with more than 1 SE, if only one (or less than all) will be down, schedule downtime only for that SE

- for sites with more than 1 SE, if all will be down, schedule site downtime

 

Summarizing:

If a site has no SEs available, it should be declared as down

Progress on 2007-05-07: There is a long running e-mail discussion on this.

Progress on 2007-05-07: no update

Progress on 2007-05-21: In progress.

COD + SAM

In progress

19/03/07

33

CERN ROC, FNAL

We set up a 2nd lcg gateway for redundancy. But if either goes down, SAM flags us as being down, thereby defeating the purpose of the 2nd gateway. Of course we are still operational, only SAM is marking us incorrectly. How can this be improved? I was told CERN runs multiple gateways, how do they handle this? 2. We need to split the cmswnNNN accounts on the 2 gateways since they operate independently

Judit: SAM considers the gateways separately as two computing elements and they are both monitored.

The issue has to be followed off-line between SAM support and FNAL.

Progress on 2007-05-14:Judit is following up

Progress on 2007-05-21: SAM support didn't find the problem they reported, and has explained them that they only see the expected behavior of SAM; any additional comment from the originators or can we close this action?

Update from FNAL: FNAL is checking again if they appear as completely down when only one CE is down. Test being done now, more news next week.

Progress on 2007-04-06: according to Joe Kaiser and Judit Novak, the issue is solved.

SAM

To be closed

14/05/07

4/06/07

34

SWE ROC

What is the status of the VO configuration tool in the CIC portal? Is it available?

Progress on 2007-05-14: work in progress, we’ll report in the coming weeks

Progress on 2007-04-06: update requested by mail to the CIC portal team:

The Oracle migration is complete. So are the small modifications of the way VO registrations are handled in the YAIM VO Configurator tool (the use of the CIC VO table and the difference in the way "official" and "non-official" registrations are handled). What is needed is just some more tests before i commit the code in the IN2P3 CVS repository

Hélène Cordier

In progress

11/06/07

36

SWE ROC

Items for COD meeting agenda in Stockholm: How to deal with failing PPS sites.

Progress on 2007-04-06: It is now included in COD’s agenda. Wait for report and conclusion

OCC

New

04/06/07

37

ROC France

YAIM dosen’t support multiple clusters/sub-clusters per CE.  A bug needs to be submitted for this.

Progress on 2007-04-06: https://savannah.cern.ch/bugs/?26757

To be closed.

Steve Traylen

To Be Closed

04/06/07

38

ROC UKI

UKI-SCOTGRID-GLASGOW had to clear jobs which were stalled due to lcg-cr commands hanging

(http://scotgrid.blogspot.com/2007/05/users-and-stalled-jobs.html). No response from biomed user who was responsible for most of these (https://gus.fzk.de/pages/ticket_details.php?ticket=22717). User will be banned from our site if no response is forthcoming.

We believe this is a reasonable policy for our site, but are there official guidelines on this?

Steve has contacted Romain to see if there is anything existing on this policy area.

Romain forwarded a proposal to the OSCT, under discussion.

Steve/Romain

New

18/06/07


Closed Action Items

#

Raised by:

Description

Assigned

To:

Due date

Date closed

71

 

ATLAS to check if they know of any conflicts between SL kernel version 2.6 and either the application software or the middleware.

Progress on 2006-12-11: Atlas was not present at the meeting, we’ll check offline

Progress on 2006-12-18:

Alessandro DeSalvo says in respect of the ATLAS application there is no problem. The only issue might raise (but not sure at all) from the Oracle client in the production system (which anyway has only 4 instances in the all Grid) and the Data Management Clients in the VOBOXES. So as long as this discussion does not refer to VOBOXes, this is OK. The VOBOXes (only 10 nodes for atlas, one at each T1) will need to be considered some time soon. I will get in touch with Miguel for this.

The action can be closed.

Simone

11/12/06

18/12/06

76

 

VOs to update their mailing lists of grid users so that grid operational messages are communicated to all users when necessary.

Progress on 2007-01-08: The affected VOs have been contacted and the mailing lists updated. The action can be closed.

OCC / VOs

15/01/07

15/01/07

75

 

Provide DPM to ATLAS for testing purposes in PPS service.

Progress on 2007-01-08: 2 DPM nodes are now intalled at CERN PPS site. In progress

Progress on 2007-01-15: This is now done. ATLAS will use lxb2058 which is a DPM at the CERN_PPS site.

Nick

15/01/07

15/01/07

3

 

Maite to forward e-mail regarding solutions to APEL problems at IFAE, to the SWE ROC.

Progress on 2007-01-18: This problem is now fixed. The action can be closed.

Maite

29/01/07

29/01/07

2

 

SA3 to set up a wiki page to give guidance on renewing host certificates for the different grid services.  OCC to circulate the wiki page and ask for feedback on it’s suitability.

Progress on 2007-01-15: The wiki pages can be found here:

https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to_replace_host_certificates

Progress on 2007-01-29: There has been no negative feedback so this item will be closed.

OCC

29/01/07

29/01/07

4

 

Publish links to all SRM monitoring.

Progress on 2007-01-18: The link to the SAM SRM monitoring is:

https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=SRM&regions=CERN&regions=France&regions=UK_Ireland&regions=
GermanySwitzerland&>regions=Italy&regions=CentralEurope&regions=NorthernEurope&regions=
SouthEasternEurope&regions=SouthWesternEurope&regions=Russia&regions=AsiaPacific&regions=IU_iuatlas&
regions=USCMS&regions=USATLAS&vo=ops&order=RegionName&funct=ShowSensorTests


The monitoring used by the FTS experts is a prototype and can be found here: http://pcitgm02.cern.ch:8081/

Progress on 2007-01-29: Completed.  This item can be closed.

OCC

29/01/07

29/01/07

67

 

Timescale for move to Torque2?

Progress on 2006-10-30: In progress.

Progress on 2006-11-01: Expected to be in certification within 2-3 weeks.

Progress on 2006-11-20: The counter was restarted last Friday, so it will go to PPS in 2-3 weeks from now.

Estimated timeline: in PPS by ~15th Dec

Progress on 2006-12-18: it was released to PPS on Monday, and removed on Tuesday due to a critical problem found.

Progress on 2007-01-15: This is now being certified at CERN and by one SA3 partner in Greece.

Progress on 2007-01-29: Still in certification.

Progress on 2007-02-12: Waiting certification report from SA3 greek partner

Progress on 2007-02-19: The patch containing this should be in the PPS next week.

Progress on 2007-02-26: The patch is included in gLite 3.0 PPS-update 20, which is being deployed. The action can be closed.

OCC

29/01/07

26/02/07

1

 

Document VO expiration procedure and associated error message when it happens.

Progress on 2007-01-15: This will be put into the GOC wiki.

Progress on 2007-01-29: In progress.

Progress on 2007-02-12: Maite will update it this week once she gets access to the GOC wiki

Progress on 2007-02-19: In progress.

Progress on 2007-02-26: Gocwiki updated. This action can be closed.

OCC

22/01/07

26/02/07

6

 

Item for COD agenda:  Running the RB SAM tests on demand. Should this be a request to the test developers or should the CODs be in the OPS VO?

Progress on 2007-02-12: Report about the conclusion once the COD minutes are published

Progress on 2007-02-19: The minutes of the COD meeting are not yet published.

Progress on 2007-02-26: The minutes of the COD meeting are not yet published.

Progress on 2007-03-05: This has been solved by finding a second RB to send SAM tests. The action can be closed.

Helene

12/02/07

05/03/07

05/03/07

7

 

Item for COD agenda:  Filtering out of SAM site test failures due to the failure of a grid central service.

Progress on 2007-02-12: Report about the conclusion once the COD minutes are published.

Progress on 2007-02-19: The minutes of the COD meeting are not yet published.

Progress on 2007-02-26: The minutes of the COD meeting are not yet published.

Progress on 2007-03-05: A 3 point strategy has been defined at last COD. They are looking for people to implement it:

http://goc.grid.sinica.edu.tw/gocwiki/Tools_Improvements_for_COD#head-619c8018ab3d7bb773674324b16d97e51bc82b98

However, timeline and details need to be refined by SAM Team as much part of the work is to be done at SAM and they are closer to the thing.

The action can be closed.

Helene

12/02/07

26/02/07

05/03/07

9

 

Request from DM developers, they are testing SRM 2.2, gfal and fts need to know the version of srm they are talking to. Test will be added to gstat by the end of this week.

Progress on 2007-03-05: this should go in as a WARN this week to raised to ERROR later in a couple of weeks.

This action can be closed.

Gstat team

12/03/07

05/03/07

10

 

Escalate ticket 18279, WMS condorc-luncher files filling /tmp, raised by the DECH ROC, to the EMT

Progress on 2007-03-05: Tmpwatch can be configured to clean those files up more often, even once a day, if needed.
The location and verbosity of those files was made configurable as of Condor version 6.8.4
To the best of my understanding this Condor version is being tested for further distribution, but this issue is closed as far as development goes.
The Condor version is 6.8.4:

 https://savannah.cern.ch/patch/?1062

The configuration is being worked on now, so it will be made available in around a month.

The action can be closed.

OCC

12/03/07

12/03/07

11

 

Point for next meeting agenda: information about WMS – RB deployment strategy

Progress on 2007-03-05: Ian Bird will attend next meeting to give a status update on this.

The action can be closed.

OCC

12/03/07

12/03/07

70

 

Conclude on "Policy for security updates of third party software".

The gLite integration team policy is: the external packages are not guaranteed to be maintained. They are provided for convenience. They are maintained by their providers.

The reality is that they will be maintained on best effort.

To be clarified with the security team.

Progress on 2006-11-27: being discussed with OSCT and SA3

Progress on 2006-12-04: This item was discussed during the meeting.  Waiting for SA3 to create the final list of external packages which need to be maintained.

Progress on 2007-01-15: Waiting for feedback from SA3.

Progress on 2007-01-29: Waiting for feedback from SA3. There is a proposal to focus on removing all unnecessary dependencies within the gLite code which will probably impact this item, so the deadline will be extended.

Progress on 2007-02-12: From Oliver: The progress to date is here;

https://twiki.cern.ch/twiki/bin/view/EGEE/SourceTarballs

In other words, some sorting of the externals; identification of what is maintained or out-of-date, what can be reclassified etc.

There is a plan being drawn up for a big effort on reducing the external dependencies. If I remember correctly, this plan is due next week. I think it's best to wait for that. The work done so far will be fed into this plan.

Progress on 2007-02-19: No update.

Progress on 2007-03-05: (Update offline after the meeting)

A gLite restructuring plan has been worked out by the integration, middleware and operations teams to make a radical examination of the code base with a view to removing unnecessary dependencies and cleaning up sections of the code that cause build and porting difficulties.

We following tasks have to be well advanced before the execution of the plan can be started. The EGEE PMB monitors the progress every second week.

- Move to SL4 on worker nodes and user interface

- Move to ETICS build infrastructure

- Stabilization and scalability of WMS and LB

- Stabilization and scalability of the gLite-CE

Even if the plan execution has not started, the developers have already started cleaning the dependencies while porting to the new build system.

I would propose to close this action as it is being tracked somewhere else, and come back with information to the operations meeting once it is ongoing.

OCC

15/01/07

11/02/07

12/03/07

8

 

From COD report: RB-time_to_match test must be improved. Many alarms linked with this test were useless. To be fixed by UKI, Steve will open a GGUS ticket and assign it to them

Progress on 2007-03-05: ticket 19319 opened

https://gus.fzk.de/ws/ticket_info.php?ticket=19319&from=ID

Progress on 2007-03-12: In progress.

Progress on 2007-03-19: This is now fixed and the ticket can be closed.

UKI ROC

12/03/07

19/03/07

15

 

OCC to chase up a solution for ticket 19464 (https://gus.fzk.de/pages/ticket_details.php?ticket=19464)

Progress on 2007-03-19: This ticket is now being handled by the gLite integration team.

OCC

26/03/07

19/03/07

16

 

LHCb to produce a list of services that will be affected due to the new SRM v1 endpoint at INFN-T1.  Also, a coordinator for the intervention must be found.

Progress on 2007-03-19: The list of services has been provided.  The intervention coordinator is Marianne Bargiotti.

LHCb

26/03/07

19/03/07

17

 

LHCb wish to use dccp across the WAN to do some pre-staging.  LHCb to provide the relevant tier 1 sites with the details of what is required (port numbers, port types, etc.)

Progress on 2007-03-19: LHCb has directly contacted the sites involved (RAL, GridKA and IN2P3).  RAL is in the process of opening the relevant ports.  GridKA and IN2P3 have yet to confirm their position.

LHCb

26/03/07

19/03/07

13

-

Some sites are accidentally publishing a local LFC as a global LFC. How might this be prevented?

Progress on 2007-03-19: Steve spoke to Judit who thinks that a solution can be implemented in the FCR tool. She will check.
The requirement for the FCR can be found here: https://savannah.cern.ch/bugs/index.php?24812

Progress on 2007-03-26: no update at the meeting

Progress on 2007-04-02: Judit has analysed the feasibility of providing this functionality using the FCR and it can be done.  It will go into the work plan.

Judit Novak

2/04/07

2/04/07

18

-

A middleware tool for carrying out the bulk removal files from SEs, appropriately updating the catalogues, etc. is needed.  This will be put onto the List of Issues maintained by the ROCs and regularly presented to the TCG.

Progress on 2007-04-02: This is now on the ROC Top Issues list, item 23 (https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_TCG)

OCC

02/04/07

02/04/07

19

-

Request for mechanism so top level BDIIs can publish themselves

Progress on 2007-03-26: bug opened by Syeve Traylen, to be raised at the EMT:

https://savannah.cern.ch/bugs/?25033

Progress on 2007-04-02: This work has been scheduled to be done immediately after the release of gLite 3.1.

OCC

02/04/07

02/04/07

21

-

Development/testing/certification/PPS status of MySQL LFC

Progress on 2007-03-26: PPS: several local instances, plus the global LFC at PIC run MySQL.

Certification: Both flavours of LFC are tested and have permanent installations on the testbed. To be closed.

OCC

02/04/07

02/04/07

22

-

Estimation on when SLC4 WNs will be available in production

Progress on 2007-04-02: The SL4 natively compiled WN is now in the PPS.  It will be tested by the HEP VOs and when they are happy with it, the WN will be passed to production. Action to be closed.

OCC

02/04/07

02/04/07

23

-

Estimation on when the unified version of RFIO client for DPM and castor will be in production

Progress on 2007-03-26: update from the developers: no manpower, this is not expected this year.

OCC

02/04/07

02/04/07

15

-

SA3 need to know when the larger sites plan to move to SL4 (and binary compatable) machines.  This will allow better planning of the move of the middleware from SL3 to SL4.  This information should be send to Maite.Barroso.Lopez@cern.ch  and  Nicholas.Thackray@cern.ch

Progress on 2007-03-26: The following ROCs/sites have supplied information:

SW Europe ROC/PIC, DECH/FZK, SARA, SE Europe ROC, NIKHEF.  Please can the ROCs gather data from all remaining large sites.

Progress on 2007-04-02: In progress.

Progress on 2007-04-16: It can be closed, all major sites have given feedback, put together at the following wiki:

https://twiki.cern.ch/twiki/bin/view/EGEE/Sites_Plans_to_go_to_SLC4

All EGEE ROCs

16/04/07

26/03/07

24

ROC DECH

Ask the R-GMA development team if they can attend the next grid operations meeting to answer questions on instabilities seen in the R-GMA system.

Progress on 2007-04-16: Done, the action can be closed

OCC

16/04/07

16/04/07

25

All

Ask SA3 to give an update at the next grid operations meeting on the status of the port to SL4 and also the relative priorities of the different middleware services.

Progress on 2007-04-16: Done this can be closed

OCC

16/04/07

16/04/07

28

OSG

Check that Laurence Field is the correct person for OSG to contact regarding their problems with the BDII.

Progress on 2007-04-16: Yes, it is Laurence, please, contact him. The action can be closed.

OCC

16/04/07

16/04/07

29

CERN/ Triumf/ UK

Escalate the problems being seen with the job wrapper tests to the ROC managers’ meeting.

Progress on 2007-04-16: This was done, Piotr presented it at today’s meeting (see minutes). The action can be closed.

Nick

16/04/07

3/04/07

14

-

Site HPC2N have the SE ibelieve-i.hpc2n.umu.se that they want to take out of production.  Simone to look at the data on the SE and decide if it needs to be moved or can be thrown away.

Progress on 2007-03-19: The site has provide ATLAS with a list of files which exist on the SE.  ATLAS will go through this and decide what to do with the files.  Per requests that this be carried out as quickly as possible.

Progress on 2007-03-26: No progress due to ATLAS’s problems with CASTOR and the ATLAS week in Germany this week.

Progress on 2007-04-02: No-one at the meeting knew of any update. Per will follow it up.

Progress on 2007-04-16: Nobody from atlas online, chase offline

Progress on 2007-04-23: As there was no response form Atlas, the files were removed. Contact the site admin if more info is required: ake.sandgren@hpc2n.umu.se

This action can be closed.

Simone C. / Per Oster

23/04/07

26/03/07

34

SEE ROC

gLite 21 update release notes stated that reconfiguration is needed just for lcg-CE, lcg-CE_torque, and glite-CE, but in fact you need to introduce new accounts on all WNs at the same time. This is the list of GGUS ticket we crated so far:
https://gus.fzk.de/pages/ticket_details.php?ticket=20941
https://gus.fzk.de/pages/ticket_details.php?ticket=20942
https://gus.fzk.de/pages/ticket_details.php?ticket=21044
To conclude, I would say that release note missed to mention some important things yet again, and that it can badly affect VOs that massively use prd or sgm accounts

Progress on 2007-05-07: 2 tickets are now solved and one unsolved (Savannah bug opened), so the action can be closed.

OCC - YAIM

07/05/07

07/05/07

26

ROC SEE

Check up on the status of the following tickets:

https://gus.fzk.de/pages/ticket_details.php?ticket=18689
https://gus.fzk.de/pages/ticket_details.php?ticket=18353

Progress on 2007-04-16:

18353: not solved but some activity, ongoing.

18698: no progress for a long time, no answer. To be raised again.

Progress on 2007-04-23: These tickets have been raise to the EMT on 25/04/07. Some action will follow and will be reflected in the tickets.

Progress on 2007-05-07: These tickets are now being handled.

OCC

07/05/07

16/04/07

31

COD (Russia)

Solve first issue in the list of COD notes for today’s meeting: http://egee-docs.web.cern.ch/egee-docs/operational_tools/Operations_Meetings/2007/Weekly_Operations_Meeting_minutes_2007-04-23.htm

Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lapp-rb01.in2p3.fr:7772)

Progress on 2007-05-07: This is not an issue for the PPS to solve, but for the site to solve. The site should ask for help from their ROC and if the ROC cannot help then a ticket should be raised. Close.

PPS coordination

07/05/07

07/05/07

20

-

Search or request documentation about how is a VOMS proxy mapped on a grid node (CE, SE, etc.) using LCMAPS.

Maria Dimou provided pointers to docs from 2005. Waiting for input from Maarten and David Groep.

Progress on 2007-04-02: In progress.

Progress on 2007-04-16: The available information was collected and sent to Pierre. It is not enough, he needs the rules and they are not available. Suggestion to open a ticket to the developers.

Progress on 2007-04-23: Pierre forgot to open the ticket, he will do it this week so this issue can be assigned and solved.

Progress on 2007-05-07: a GGUS ticket was submitted (GGUS #21349) to ask developers for a complete LCMAPS administrator guide.  The ticket has been raised to the EMT.

Progress on 2007-05-14: Comment form the EMT: we will work on it as soon as possible. This action can be closed.

OCC

16/04/07

14/05/07

21/05/07

27

OCC

Report on the outcome of the tests regarding grid site status information (taken from the information system, the WMS, the batch systems, etc.).

Progress on 2007-04-16: Patricia reported at it during the meeting. We’ll leave it open till we have the outcome

Progress on 2007-04-23: data form grid site tests were made available in MonAlisa. Alice is now checking that issues reported are actually consistent, that's the reasons for the detailed questions on the queues done to GRIDKA during this meeting.

Progress on 2007-05-07: In progress. Alice to report at the next meeting.

Progress on 2007-05-14: Patricia reported about this; see her report in the Alice section of the minutes. This action can be closed.

Alice

14/05/07

21/05/07

30

ROC SEE

Create a wiki page to collect information about deployment of SL4 gLite services, even with workarounds:

http://wiki.egee-see.org/index.php/SL4_WN

Progress on 2007-04-23: not done yet. Does the SEE ROC volunteer to do it under the main SA1 wiki or at the GOC wiki? Kostas: no manpower. Volunteers are welcome.

Progress on 2007-05-14: there are no volunteers, not many people interested, so ROC SEE agrees to close the action

OCC

23/04/07

21/05/07

32

CERN ROC, TRIUMF

SAM still handles timezones incorrectly. Maintenance on Fri 20th scheduled for 14:00 - 16:00 UTC but SAM show maintenance incorrectly at 08:04 UTC and in error at 14:02 UTC, i.e. wrongly during our maintenance

Judit: the timestamp is taken by SAM directly from the GOCDB as it is registered. GOCDB seems not to correctly convert time zones. GOC DB people will look into the problem together with Judit who will send all the details already found.

Progress on 2007-05-07: This now looks like it might be a SAM issue. Still under investigation.

Progress on 2007-05-14: Judit is following up

Progress on 2007-05-21: Bug opened in Savannah to fix this: https://savannah.cern.ch/bugs/index.php?26500.  As this issue will be tracked in Savannah, this action item will be closed.

SAM, GOCDB

04/06/07

21/05/07

35

UKI ROC

Technical issues to do with the email that CIC-Portal Alarms send:
a) The From field should be CIC-Portal@in2p3.fr and not just CIC-Portal. Otherwise intervening mail relays add their own spurious @host info and so the mail can be misidentified by mail browsers.
b) All emails from CIC-Portal, and in2p3.fr generally, are given a Spam-Assassin rating of DNS_FROM_RFC_ABUSE 0.37, plus whatever other spam score the contents of the message might incur. This would be avoided if in2p3.fr got itself de-listed from www.rfc-ignorant.com - that shouldn't be hard!

Progress on 2007-05-21: Both points have been addressed.  Close this item.

CIC team

04/06/07

04/06/07