28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
Maite Barroso Lopez
firstname.lastname@example.org Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610
NB: Reports were not received in advance of the meeting from:
ROCs: NorthernEurope, Russia
Tier-1 sites: in2p3, fnal, ndgf, triumf
VOs: Alice, Atlas, CMS, LHCb
list of actions
Feedback on last meeting's minutes5m
<big> Grid-Operator-on-Duty handover </big>
From ROC SWE (backup: ROC SEE) to ROC DECH (backup: ROC France)
NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).
Opened new 27
2nd mails 23
All together 103 Issues:
SWE: There were a lot of alarms of nodes that were not already in the GOCDB.
Tickets resolution has been effective.
There was not any outstanding problem.
SEE:A lot of nodes were switched off monitoring in the GOCDB. Possibility to open a ticket for the site ( when a lot of alarms arise simultaneously on one and the same site) lead to less work for COD team, but we have to be more carefully in order to avoid a new tickets for separated nodes of this site, because the alarms still exist in the alarm page.
<big> PPS Report & Issues </big>
PPS reports were not received from these ROCs:
AP IT NE RU SWE
New updates have been announced to pps last Thursday:
These updates include the new version of YAIM
next update to PPS will be released out-of-schedule possibly within this week (depending on the results of certification). It will contain the last version of GFAL/dpm clients fully compliant with STORM implementation of SRMv2
Diligent started running the first phase of its data challenge this week.
The activity, involving all sites supporting Diligent, will be carried on according to the following schedule:
1st Part (2 weeks)
Start: Monday 16th July
End: Friday 27th July
2nd Part (2 weeks)
Start: Monday 20th August
End: Friday 31st August
Sites willing to start supporting Diligent and to be involved in the next phase are welcome to find more information in the
Diligent data Challenge web page
Issues from EGEE ROCs:
<big> SL4 (32/64 bit) OS publishing</big>
-> This is very well deployed and consistent
->Progress here but this is not a default publication unless sites are
To give some figures on progress:
There are 352 GlueSubClusters of which 30 publish a
Proposal: End users assume that unpublished GlueHostArchitecturePlatformType are 32bit.
If they find 64 bit sites unpublished then raise a ticket with the
ROC. New sites should be publishing correctly so the problem is finite.
<big>Contact site -> VOs </big>5m
The present situation is:
- for urgent issues/problems: GGUS; the ticket will be assigned the VO support units
- for questions, non urgent issues, etc: operations meeting; sites need to write this on the weekly CIC site (RC) reports, on the "Points to Raise at the Operations Meeting" text box
<big> EGEE issues coming from ROC reports </big>
NO ISSUES REPORTED THIS WEEK
<big> Tier 1 reports </big>
<big> WLCG issues coming from ROC reports </big>
None this week
<big>WLCG Service Interventions (with dates / times where known) </big>
DB interventions (as with all others) that do not follow the agreed (May 2006) WLCG procedure will be classified as *unscheduled* . They should be discussed with the sites through the well established channel of the weekly joint operations meeting and properly broadcasted.
<big>Preparations for LFC service at LHCb Tier1 sites</big>5m
Following the mail from Eva to email@example.com last week (see below), LHCb Tier1 sites (other than CNAF - already done) are requested to allocate the hardware for the LFC middle tier.
A single batch worker node per site is expected to be sufficient, both for load and availability services, the latter being addressed by the use of a R/O replica at another site in case of problems.
The target date for entering production for these LFC services is end September (so as not to conflict with other pressing issues, such as SL4 WNs for CMS CSA07 et al, FTS 2.0 services, SRM 2.2 etc etc).
I have included the following Tier1 sites in the LFC Streams environment: IN2P3, GridKA, PIC and RAL.
The LFC data has been imported in the appropriate schemas at the Tier1 sites LHCb databases and the replication using Streams has been successfully enabled.
Tier1 database administrators and LFC team should now validate the copy and open the LFC service to production on their side.
Please let me know if you have any question.
<big>Upgrade to SL4 WN release</big>5m
<big>preparations for WLCG Collaboration workshop - operations session</big>5m
As per the draft agenda for the operations session (see above), sites are requested to send their top 5 operations issues to Nick by August 10th so that these can be consolidated into a single list.
Suggestions for additional topics for this session should be sent by July 31st.
Suggestion from Gonzalo Merino:
Experiences from sites/experiments operating the FTS servers
Some months ago, a channel configuration at the FTS servers in the T1s was suggested (essentially T1s host channels where they are the destination). This configuration seems not to fit for instance the needs of CMS. It seems problematic also since sites have no control of the files being read from them, so if many sites start to request reading from a given T1, this could collapse the storage service. If this is the situation, we should make sure that the different SRM/Storage implementations provide sites with the tools to control these situations.
Job processing: CSA07 production status is in general steady: 39M evt/Month rate on QCD/Photon Jets assignments; so far 24.5M + 24M (Minbias) PH soup produced, will soon assign next 50M; 2 first CSA07 DPG requests started. Some sites did not join the MC production last week for different reasons (e.g. CNAF had 2 days of Castor upgrade - now done, ASGC still misses CMSSW 1_4_4 deployment - in progress)
Data Transfers: "production" transfers: GEN-SIM data shipping to T1s on-going; "test" transfers: LoadTest continues, Debugging Data Tranfers program launched, first report due next Thursday.
The various client tools (FTS, GFAL, lcg_utils) have been enhanced to support SRM v2.2. During certification and testing, some bugs have been found. More details on the schedule and the feature list will be provided at future operations meetings. See also today's LCG ECM.