28-R-15 (CERN conferencing service (joining details below))
28-R-15
CERN conferencing service (joining details below)
Description
grid-operations-meeting@cern.ch Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
GGUS representatives
VO representatives
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
NB: Reports were not received in advance of the meeting from:
ROCs:
VOs: Alice, BioMed, CMS, LHCb
16:00
→
16:00
Feedback on last meeting's minutes
Minutes
16:01
→
16:30
EGEE Items29m
<big> Grid-Operator-on-Duty handover </big>
From: SWE / Italy
To: DECH/ Russia
Report from Italy COD:
Site: ru-Moscow-GCRAS-LCG2,
GGUS #34045, #34051, #34817
Reached last escalation step, but then the site reacted with:
"Still problem with certificates, including users certs and RA."
The RA itself has certificate problems, and is making the papers to be renewed.
We gave them the possibility to wait for this, in downtime state, because it is not a software problem to be corrected, but just a wait for new certificates to be provided by CA/RA.
Report from SWE COD:
Australia-UNIMELB-LCG2:
GGUS Ticket #34393
Site comments that their SE is full because of atlas VO not removing files. Is this a problem of atlas VO or should the site reserve disk space for the ops VO?
YerPHI:
GGUS Ticket #26634
Site is transfered to the politiccal instance but neighter on Scheduled Downtime no suspended.
What is the latest status on this?
<big> PPS Report & Issues </big>
PPS reports were not received from these ROCs:
AP, FR, IT, NE
Issues from EGEE ROCs:
Cern ROC: yaim-core 4.0.4, released with gLite 3.1.0 PPS Update 22 introduces a check that blocks the configuration if read permissions are given to non-root users on the site-info file and the directory where it is stored . This causes problems in set-ups where the permissions cannot be changed to 700 (e.g. installations of UI on AFS). A bug has been opened for that (https://savannah.cern.ch/bugs/?35307), and the check will be softnened in version 4.0.5. Sites installing version 4.0.4 should be prepared to change a function in yaim as described in https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Known_issues
<big> gLite Release News</big>
Release News:
Now in production
gLite 3.1 Update18 went to production last Monday.
The update contains:
NEW: glite-MON for SL4
DPM 1.6.7-4
fix for bug #33769: incorrect pool free space after dpm-drain
improved ACL management for srmMkdir command
UI/WN/VOBOX
lcg-tags non longer produces Globus warnings suppressed
voms-admin client 2.0.6-1 providing ACL support on command line
vdt_globus_essentials (affecting several services and notably the CE)
bug fix to prevent globus-job-manager processes to pile-up on a CE
(big observed at CERN after SAM WMS?RB tests were enabled )
voms-admin server (VOMS)
Refactored voms-admin-ping script
ACL management web service (compatible with client >= 2.0.6-1)
gLite3.1.0 PPS Update22 passed the pre-deployment tests
and it is now being installed by the PPS sites.
The release contains, among others, an update of yaim-core, so,
technically, all services are concerned.
The full list of patch deployed is:
glite-AMGA_oracle (initial release)
UI/WN/VOBOX
GFAL/lcg_util: many bug fixes
new lcg-ManageVOTAg version (solving bug 34245)
lcg-infosites: new option to query the wms and lb associated to a VO.
-f option to filter based on the site name
[ YAIM ] glite-yaim-clients: bug fixes + configurable list of WMS and LB
R-GMA
Switch back to using MEMORY instead of DATABASE producer
YAIM (affecting all nodes)
new yaim-core with a consistent list of changes and bug fixes
CE
change to lcg-info-dynamic-scheduler to support DENY tags
2008-04-11(1): Task: gL3.1 U19 --> Production in preparation
The update will contain:
UI/WN/VOBOX
may bug fixes, including the on epreventing to use aliases for WMS
new lcg-ManageVOTAg version
MON
R-GMA fix for forwards compatibility - 3.1.0 PPS Update 22
Many services
lcg-vomscerts-4.9.0 adds next cert for lcg-voms
<big> EGEE issues coming from ROC reports </big>
(ROC CE): Majority of CE sites failed SAM due to wrongly advertised LFC for OPS VO. https://gus.fzk.de/pages/ticket_details.php?ticket=35093
It is a weak point of the infrastructure that a site can publish anything and make all sites fail OPS tests. Are there any plans to change it?
(ROC France): OPS test was using lfc-lhcb.grid.sara.nl as LFC server for OPS.
This shows the information service cannot be trusted, it s a point of failure that allows anyone to deny service to others.
Please, would it be possible to consider a GRID where nobody could just break the grid by publishing something wrong ?
16:30
→
17:00
WLCG Items30m
<big> WLCG issues coming from ROC reports </big>
None this week.
<big>WLCG Service Interventions (with dates / times where known) </big>
[INFO] FZK Downtime: Due to the LFC DB migration from MySQL to Oracle, GridKa/FZK s LFC service will be down on Friday 18/04/2008 from 5:30 UTC to 20:00 UTC (LHCb LFC will not be affected by this).
DB downtime at CERN-PROD taking down FTS, SAM, GridView, VOMS and LFC, Thursday April 17th 2008. All the details
Last week functional test was quite good.
During last week we also exportedsubdetector data (Calorimeter), 99% within the first 24h.
These tests were performed using the newly written "plugin", that will allow us to swiftly react on sites having problems.
This week:
T1-T1 FT, CNAF indicated they are ready,but also other T1s could try (or try again if they had already tried).
Probably also this week there will be data from subdetector (Muons) to be exported, like it was done last week.
<big> CMS report </big>
News on Development:
Logfiles archiving: post-poned to ProdAgent v.0.9. Chained processing: implementantion largely in place, still scheduled for June release; dealing with large MySQL DBs: some improvement indeed came with latest release, still working on it.
Data certification, Processing at the T0:
CERN very busy with RelVal production. Validated releases: CMSSW v1.8.4, CMSSW V2.0.0_pre9. High statistics RelVal samples could not be started at FNAL due to problem, had to use CERN. Tier-0 unavailable due to production, limited to relVal queue. Upcoming release is the 2.0.0. It will take precedence over 1.1.0_pre1 if necessary, the standard set will run at CERN, the high statistics set will run at FNAL in parallel to massive FastSim production.
Re-processing:
still running the never-ending CSA07 signal workflows: allrequests finished, waiting for more input datasets, transfers seem not to work as good. Soups at FNAL: work in progress. The important 1.8.4 FastSim production has started: AlcaReco & physics requests, started at all T1 (also those in don, now are used, e.g. FZK and CNAF). Problems mostly at the config level and due to start-up, not really site issues (yet).
MC production:
40k cosmics data with CMSSW v1.7.7 now available to physicists in global DBS. 10M cosmics requet with CMSSW v1.8.4 has srated in OSG, plus some more samples. FastSim production: all requests injected in ProdRequest.
Data Transfers and Integrity, DDT-2/LT status:
Low transfer activity (/Prod instance) from CERN to T1 sites (only RAL and FNAL, ~3 TB out of CERN). ~1 TB tape backlog from T1's seen at FNAL. The t1transfer pool at CERN had peaks all within 1k max files to be migrated to tapes. --- Running a campaign to overview production transfers which did not complete within 30 days from the subscription: it will help to cut the tails wherever useless and identify problems/bottlenecks in the production transfer system (or in the transfer tool), much work needed still on top on such provided lists, though. --- DDT status: We have 317 commissioned links (as of April 11th), +23 wrt last week (!). The breakdown is: all 56 T[01]-T1 crosslinks (some to be re-exercised to due back up&runnning after downs); 162/320 (51%) T1-T2 downlinks and 93/320 (29%) T2-T1 uplinks; 6 T2-T2 links. From the "Site Commissioning" pov, concerning the link testing, 37/40 T2 have at least 1 commissioned downlink upink to the associated T1, and - among these - 30 have at least 2 commissioned T1-T2 downlinks. In total, 93% of the previously commissioned links have already PASSED the new metric as of April 11th (2 months after the start of this DDT-2 phase). --- Day-2-day details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising, and (NEW!) more details now visible again online at Nicolo's page: http://magini.web.cern.ch/magini/ddt.html.