WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-06-25T16:00:00+02:00
End: 2007-06-25T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 25 Jun 2007, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs:

Tier-1 sites: ASGC, FNAL, IN2P3, INFN

Tier-1 availability reports:

VOs: Alice

- 16:00 → 16:05
  
  Feedback on last meeting's minutes 5m
  
  Minutes
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From ROC Taiwan (backup: ROC UKI) to ROC Italy (backup: ROC CERN)
    
    NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).
    
    Tickets:
    lead team:
    Opened new 30
    Closed 48
    2nd mails 8
    Extend 25
    Quarantine 10
    
    Issues:
    
    Still have several sites with no new SAM results in Alarm page.
    
    xxx
    
    Backup team:
    xxx
  - <big> PPS Report & Issues </big>
    
    PPS reports were not received from these ROCs: AP, IT, NE
    
    Release:
  - gLite3.0.1 PPS Update 33 released today to PPS.
    It contains the new version of LFC/DPM (1.6.5-1) as high priority patch
  - Last week the gLite3.1.0 UI was released to PPS and it is now available for PPS users.
    The configuration of the relocatable version (tar UI) was re-structured in a way that Users are affected (see details among issues in Cern ROC report).
  - Patches #1179(LFC/DPM);#1174(BDII indexing);#1200(yaim 3.0.1-22) will be fast-tracked to production this Wednesday.
    
    Operations:
  - Cyfronet has started the operation of its new SAM Client, sending test to PPS.
    Currently test jobs are submitted every hour, alternatively, from Cyfronet, and CERN.
    
    Issues from EGEE ROCs:
    1. (ROC DECH): Still awaiting more tests for the new SRM 2.2 instance. No further update on the pps status
    2. (ROC Cern): Report from CERN_PPS
      During the installation of the gLite 3.1.0 PPS-U01 (tar UI) we noticed that :
      
      The directory $INSTALL_ROOT/etc was removed and its content put in $INSTALL_ROOT/external/etc . As a major consequence for the user the location of the environment script has changed
      The name of the environment script was changed from grid_env.(c)sh to grid-env.(c)sh,
      Both these changes are likely, in our opinion, to create surprise and discomfort in the user community.
      Therefore we opened two critical bugs to track these issues.
      
      https://savannah.cern.ch/bugs/index.php?27361
      https://savannah.cern.ch/bugs/index.php?27362
  Speaker: Antonio Retico (CERN)

<big> next versions of YAIM: content and timelines </big> 10m

YAIM 3.1.1 changes regarding 3.0.x ----------------------------------- - One yaim that will configure both glite 3.0 and glite 3.1 services (get rid of yaim 3.0.1 and yaim 3.1.0) - Main feature: Modularity (yaim core + yaim clients + one yaim module per service) - Won't support glite CE configuration and 3.0 WMS configuration (this will be done by yaim 3.0.1) - Doesn't contain substantial new code (except for 3.1 WMS configuration). It's a merge of yaim 3.0.1-22 + yaim 3.1.0 (for UI and WN) configuration - Timeline: - 25th to 29th June -> Certification - 2nd July -> Ready for PPS YAIM 3.0.1-22 content ---------------------- - Most remarkable: - VOViews for FQANs are not configured - Security bug fix with YAIMLOG file - Configuration of static mapping of special accounts like sgm/prd - Bug fixes from previous obsoleted yaim versions: 17/18/19/21 - Timeline: - 25th June -> Ready for PPS More details about the content and timelines of the releases in the YAIM Planning page: https://twiki.cern.ch/twiki/bin/view/LCG/YaimPlanning

Speaker: Maria Alandes Pradillo (Unknown)

<big> EGEE issues coming from ROC reports </big>

ROC Central Europe:
For information: SAM replica instance. Performance tests preparations ongoing, installing Cacti for gathering and plotting machine/network performance metrics. Plan to start sending SAM test jobs each 12h on 27.08.2007. Then increase submission frequency till 1h period
For information: prepared regional Nagios monitoring system to migration from GOCDB2 to GOCDB3. Prepared uncertified sites GIISURL list generator for migration from GOCDB2 to GOCDB3. The list is used by RB/WMS which submits to uncertified sites.
ROC France:
We noticed an increasing number of problems with SAM framework : changes (node description update, SD) in the GOC DB were wrongly taken into account by SAM, some nodes were still tested even if declared into a SD, etc. We are expecting some clarifications regarding the last changes of SAM Framework to better understand how it is supposed to work now, if there are still some problems with the last release, etc. It is important to communicate Sany AM changes, or any SAM temporary problem, to the sites. It would be damaging for production if the sites switched off the monitoring because they don\''t understand SAM behavior and then don\''t trust it anymore. Moreover, we don\''t think that SAM developers should continue to recommend the sites to switch off the monitoring of their nodes during a scheduled downtime. Indeed, if site administrator forgets to switch back on the monitoring at the end of the SD, some malfunctionning nodes might be in production without any watchdog as that was proposed by SAM+FCR up to now.
South Eastern Europe ROC:
LCG-IL-OU asks for 7 Days history of SAM tests, in order to be able to comment on errors on the weekly reports.
UKI ROC:
Two UK sites have encountered SRM (dCache) problems. The upgraded dCache version (1.7.0-36) has java memory issues of which the developers have been made aware.
Has any site implemented multiple sgm accounts on the Condor batch system? In the UK we currently have one site running a Condor batch system and until now they have not needed sgm accounts (software installation gets taken care of locally).

16:30 → 17:00

WLCG Items 30m

<big> Tier 1 reports </big>

more information
<big> UPDATE: job priorities and YAIM </big>

Very Short term: the FQAN VOViews should disappear from the information system. The VO:atlas view will then show inclusive information for ATLAS jobs submitted with any role. This means the FQAN VOViews should not come anymore with the default YAIM configuration (action for SA3) *and* they should disappear from the sites which already have deployed it, both via YAIM and by hand (action for SA1)
UPDATE: T1s requested via broadcast and GGUS tickets to remove the configuration manually.
ASGC: done
INFN: ticket not modifed
PIC: waiting for YAIM
IN2P3: in progress
GridKA: ticket not modifed
RAL: done
TRIUMF: ticket not modifed
SARA: ticket not modifed

Speaker: Simone Campana (CERN/IT/GD)

<big> WLCG issues coming from ROC reports </big>

UKI ROC: There has been a recent change to Alice policy in job submission - basically they have moved from using mutiple RBs at a site simultaneously, to using one until its fails and then switching to the next. This appears to have been part of the reason one of our RBs was inaccessible last week. We\''d like to know why this policy change was made.

SWE ROC: At PIC we upgraded the SRM-disk service on 21,22 June. We declared *only* that service in Scheduled Downtime in the GOCDB. However, we have received some complaints from users (CMS) because, since the CE service was not closed, their jobs were still entering to PIC and trying to contact the SRM-disk service, and of course failing. They tell us we should close all the services (also the CE) if we have an intervention in one of them. Is this so? We believe it makes more sense that users check the availability of all the services they consider critical in a center, and not assume that if a CE is up, all of the other services are up.

ANSWER from S. Traylen:
Need to know more information about what CMS is doing and how they locate the SE in question, also is it for reading or writing a file.
So if PIC had done the following:
* Stop publishing the glueCESE bind from the CE. This removes the association of their CE and SE.
and CMS had done
* Try and match make their files with RB matchmaking and then the RB/WMS would not have matched against this CE in the first place.
I doubt that either of these things were done. The first because it is currently hard, no tools are in place to do so. The second is pretty much because none of the users do this though CMS would have to confirm.
Things that can be practically done:
0) Find out how CMS are locating the SE in question.
1) Give sites a tool that allows them to mark services as offline (or critical in fact) See https://savannah.cern.ch/bugs/?func=detailitem&item_id=17777 for what the tool might do.
2) Have as many clients as possible respect the ServiceStatus flag. lcg-utils would be a good place to start.
3) A fudge could be done in FCR (it is a complete fudge after all) to have it delete GlueCESEbind values as well for marked down SEs. This could be done but it still then relies on CMS locating the SE in an information system aware way. In particular the DEFAULT_SE_<VO> or whatever it is defined at a site is not information system aware.

REPLY FROM CMS (Stefano Belforte): CMS analysis and production jobs are submitted with the JDL requirement requirements = anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "srm.cern.ch") ) ;
[replace srm.cern.ch wit the SE of various sites]

so what you indicate as 3) would work. I am not sure Stopping publishing the glueCESE bind from the CE would be proper, since the association exists, and e.g. maybe neede for FCR to know what to remove/add as SE go on/off.

<big>WLCG Service Interventions (with dates / times where known) </big>

Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

Time at WLCG T0 and T1 sites.

FZK/GridKa is unreachable on 27/6 from 05:00 UTC to 18:00 UTC. Network connections are restored after 18:00 UTC but maintenance will continue till 28/6 18:00 UTC. Services are impacted during the whole period of 27/6 05:00 UTC till 28/6 18:00 UTC.

<big>FTS service review</big>

FTS weekly report
Chosen sites this week: CNAF, ASGC

Speaker: Gavin McCance (CERN)

FTS 2.0 service and client compatibility issues - reminder

Following some questions last week, please see https://twiki.cern.ch/twiki/bin/view/LCG/FtsChangesFrom15To20

In particular, "Client Compatibility" and "Upgrade Path" sections at the bottom.

The relevant client release was made in October 2006.

<big> ATLAS service </big>

See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.
nothing to report

Speaker: Kors Bos (CERN / NIKHEF)

<big>CMS service</big>

General: last 2 weeks hosted CMS Annual Review week + June CMS week.
Job processing: Spring07 MC production activities last week focussed on on-going production of extra events in addition to the 35M produced since Spring. In addition, also 21M HLT-re-reco events were produced. MC production for CSA07 was of ~23M (~20 merged) Minbias GEN-SIM events produced in 4.5 days; that is a rate of ~150M (125M merged) evts/month, also thanks to more reliable sites.
Data transfers: the 'extended' LoadTest is converging into CSA07 preparation activities, and an extended plan to continue debugging transfer links is accordingly being designed.

Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)

<big> LHCb service </big>

LHCb would like to remind all sites that *before* moving all their WN resourcesto SLC4 they should provide for a limited time another CE pointing to arestricted part of their farm running the implementation of SL(C)4 they willinstall in production. This will allow VOs to test the actual OS installation atsites before fully migrating.
SL(C)4 WNs should be installed with 32-bitcompatibility even if on 64-bit nodes as currently there is no 64-bit nativemiddleware, although VO\'s applications are certified on 64-bit and can actuallyrun in native mode when they do not make use of middleware.
Of course theplatform used by the WNs of the CEs (OS, architecture and possibly compiler usedfor the client libraries) should be advertised in a consistent way according toa well defined convention and this should be enforced before the CE is put intoproduction. The current definition of the Glue schema information are still veryvague and difficult to use for matching resources. Many equivalent platformshave different advertisements and the version of the compiler is not published(which is not yet relevant but will in the near future when applications mightmove to gcc 4.1 for example). We remind that the LCG has defined a convention avery long ago for the applications (__) that is usedby the LHC VOs and matching capabilities should be provided based on thisnomenclature.

Speaker: Dr roberto santinelli (CERN/IT/GD)

<big> ALICE service </big>

Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)

16:55 → 17:00

OSG Items 5m

Item 1

17:00 → 17:05

Review of action items 5m

17:10 → 17:15

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)