WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2008-04-14T16:00:00+02:00
End: 2008-04-14T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 14 Apr 2008, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs:

VOs: Alice, BioMed, CMS, LHCb

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
  
  Minutes
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From: SWE / Italy
    To: DECH/ Russia
    
    Report from Italy COD:
    
    Site: ru-Moscow-GCRAS-LCG2, GGUS #34045, #34051, #34817
    Reached last escalation step, but then the site reacted with:
    "Still problem with certificates, including users certs and RA."
    The RA itself has certificate problems, and is making the papers to be renewed.
    We gave them the possibility to wait for this, in downtime state, because it is not a software problem to be corrected, but just a wait for new certificates to be provided by CA/RA.
    
    Report from SWE COD:
    
    Australia-UNIMELB-LCG2:
    GGUS Ticket #34393
    Site comments that their SE is full because of atlas VO not removing files. Is this a problem of atlas VO or should the site reserve disk space for the ops VO?
    
    YerPHI:
    GGUS Ticket #26634
    Site is transfered to the politiccal instance but neighter on Scheduled Downtime no suspended.
    What is the latest status on this?
  - <big> PPS Report & Issues </big>
    
    PPS reports were not received from these ROCs:
    AP, FR, IT, NE
    
    Issues from EGEE ROCs:
    
    Cern ROC: yaim-core 4.0.4, released with gLite 3.1.0 PPS Update 22 introduces a check that blocks the configuration if read permissions are given to non-root users on the site-info file and the directory where it is stored . This causes problems in set-ups where the permissions cannot be changed to 700 (e.g. installations of UI on AFS). A bug has been opened for that (https://savannah.cern.ch/bugs/?35307), and the check will be softnened in version 4.0.5. Sites installing version 4.0.4 should be prepared to change a function in yaim as described in https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Known_issues
  - <big> gLite Release News</big>
    
    Release News:
    
    Now in production
    
    gLite 3.1 Update18 went to production last Monday.
    The update contains:
    
    NEW: glite-MON for SL4
    DPM 1.6.7-4
    
    fix for bug #33769: incorrect pool free space after dpm-drain
    improved ACL management for srmMkdir command
    
    UI/WN/VOBOX
    
    lcg-tags non longer produces Globus warnings suppressed
    voms-admin client 2.0.6-1 providing ACL support on command line
    
    vdt_globus_essentials (affecting several services and notably the CE)
    
    bug fix to prevent globus-job-manager processes to pile-up on a CE (big observed at CERN after SAM WMS?RB tests were enabled )
    
    voms-admin server (VOMS)
    
    Refactored voms-admin-ping script
    ACL management web service (compatible with client >= 2.0.6-1)
    Registration web service.
    many bug fixes
    
    Details in: http://glite.web.cern.ch/glite/packages/R3.1/updates.asp
    
    Now in pre-production
    
    gLite3.1.0 PPS Update22 passed the pre-deployment tests and it is now being installed by the PPS sites.
    The release contains, among others, an update of yaim-core, so, technically, all services are concerned.
    The full list of patch deployed is:
    
    glite-AMGA_oracle (initial release)
    UI/WN/VOBOX
    
    GFAL/lcg_util: many bug fixes
    new lcg-ManageVOTAg version (solving bug 34245)
    lcg-infosites: new option to query the wms and lb associated to a VO.
    -f option to filter based on the site name
    [ YAIM ] glite-yaim-clients: bug fixes + configurable list of WMS and LB
    
    R-GMA
    
    Switch back to using MEMORY instead of DATABASE producer
    
    YAIM (affecting all nodes)
    
    new yaim-core with a consistent list of changes and bug fixes
    
    CE
    
    change to lcg-info-dynamic-scheduler to support DENY tags
    
    Details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update22
    
    gLite3.1.0 PPS Update23 was released to PPS and it is currently in phase of pre-deployment testing.
    It contains:
    
    WMS LB (SL4): first release to PPS
    UI/WN/VOBOX
    
    edg-gridftp-client-1.2.8 fixes bugs 33205, 27274
    DPM/LFC v1.6.10
    
    DPM/LFC
    
    DICOM back-end service for DPM
    re-buildable source RPMs
    support for MacOSX
    group writable directories when SRM started with umask 0
    bug fixes
    
    CE
    
    Patch to improve the performance of lcg CE
    
    Several serivces affected
    
    lcg-vomscerts-4.9.0 adds next cert for lcg-voms
    
    Details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update23
    
    gLite3.0.2 PPS Update47 was released to PPS and it is currently in phase of pre-deployment testing.
    It contains:
    
    FTS:
    
    FTA Update: change the gridFTP session handling
    
    Details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_302_PPS_Update47
    
    Soon in production
    
    2008-04-11(1): Task: gL3.1 U19 --> Production in preparation
    The update will contain:
    
    UI/WN/VOBOX
    
    may bug fixes, including the on epreventing to use aliases for WMS
    new lcg-ManageVOTAg version
    
    MON
    
    R-GMA fix for forwards compatibility - 3.1.0 PPS Update 22
    
    Many services
    
    lcg-vomscerts-4.9.0 adds next cert for lcg-voms
  - <big> EGEE issues coming from ROC reports </big>
    
    (ROC CE): Majority of CE sites failed SAM due to wrongly advertised LFC for OPS VO. https://gus.fzk.de/pages/ticket_details.php?ticket=35093 It is a weak point of the infrastructure that a site can publish anything and make all sites fail OPS tests. Are there any plans to change it?
    
    (ROC France): OPS test was using lfc-lhcb.grid.sara.nl as LFC server for OPS.
    This shows the information service cannot be trusted, it s a point of failure that allows anyone to deny service to others.
    Please, would it be possible to consider a GRID where nobody could just break the grid by publishing something wrong ?
- 16:30 → 17:00
  WLCG Items 30m
  - <big> WLCG issues coming from ROC reports </big>
    
    None this week.
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    [INFO] FZK Downtime: Due to the LFC DB migration from MySQL to Oracle, GridKa/FZK s LFC service will be down on Friday 18/04/2008 from 5:30 UTC to 20:00 UTC (LHCb LFC will not be affected by this).
    DB downtime at CERN-PROD taking down FTS, SAM, GridView, VOMS and LFC, Thursday April 17th 2008.
    All the details
    
    Time at WLCG T0 and T1 sites.
  - <big> CCRC'08 Operational Review </big>
    
    Speaker: Harry Renshall / Jamie Shiers
  - <big> Alice report </big>
    
    No report received before the meeting.
  - <big> Atlas report </big>
    
    Last week functional test was quite good.
    During last week we also exportedsubdetector data (Calorimeter), 99% within the first 24h.
    These tests were performed using the newly written "plugin", that will allow us to swiftly react on sites having problems.
    
    This week:
    T1-T1 FT, CNAF indicated they are ready,but also other T1s could try (or try again if they had already tried).
    Probably also this week there will be data from subdetector (Muons) to be exported, like it was done last week.
  - <big> CMS report </big>
    
    News on Development:
    Logfiles archiving: post-poned to ProdAgent v.0.9. Chained processing: implementantion largely in place, still scheduled for June release; dealing with large MySQL DBs: some improvement indeed came with latest release, still working on it.
    Data certification, Processing at the T0:
    CERN very busy with RelVal production. Validated releases: CMSSW v1.8.4, CMSSW V2.0.0_pre9. High statistics RelVal samples could not be started at FNAL due to problem, had to use CERN. Tier-0 unavailable due to production, limited to relVal queue. Upcoming release is the 2.0.0. It will take precedence over 1.1.0_pre1 if necessary, the standard set will run at CERN, the high statistics set will run at FNAL in parallel to massive FastSim production.
    Re-processing:
    still running the never-ending CSA07 signal workflows: allrequests finished, waiting for more input datasets, transfers seem not to work as good. Soups at FNAL: work in progress. The important 1.8.4 FastSim production has started: AlcaReco & physics requests, started at all T1 (also those in don, now are used, e.g. FZK and CNAF). Problems mostly at the config level and due to start-up, not really site issues (yet).
    MC production:
    40k cosmics data with CMSSW v1.7.7 now available to physicists in global DBS. 10M cosmics requet with CMSSW v1.8.4 has srated in OSG, plus some more samples. FastSim production: all requests injected in ProdRequest.
    Data Transfers and Integrity, DDT-2/LT status:
    Low transfer activity (/Prod instance) from CERN to T1 sites (only RAL and FNAL, ~3 TB out of CERN). ~1 TB tape backlog from T1's seen at FNAL. The t1transfer pool at CERN had peaks all within 1k max files to be migrated to tapes. --- Running a campaign to overview production transfers which did not complete within 30 days from the subscription: it will help to cut the tails wherever useless and identify problems/bottlenecks in the production transfer system (or in the transfer tool), much work needed still on top on such provided lists, though. --- DDT status: We have 317 commissioned links (as of April 11th), +23 wrt last week (!). The breakdown is: all 56 T[01]-T1 crosslinks (some to be re-exercised to due back up&runnning after downs); 162/320 (51%) T1-T2 downlinks and 93/320 (29%) T2-T1 uplinks; 6 T2-T2 links. From the "Site Commissioning" pov, concerning the link testing, 37/40 T2 have at least 1 commissioned downlink upink to the associated T1, and - among these - 30 have at least 2 commissioned T1-T2 downlinks. In total, 93% of the previously commissioned links have already PASSED the new metric as of April 11th (2 months after the start of this DDT-2 phase). --- Day-2-day details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising, and (NEW!) more details now visible again online at Nicolo's page: http://magini.web.cern.ch/magini/ddt.html.
    LINKs:
    Computing meetings of the week: http://indico.cern.ch/conferenceDisplay.py?confId=31923
    
    Speaker: Daniele Bonacorsi
  - <big> LHCb report </big>
    
    No report received before the meeting.
- 17:00 → 17:30
  OSG Items 30m
  
  Speaker: Rob Quick (OSG - Indiana University)
  - Discussion of open tickets for OSG
- 17:30 → 17:35
  
  Review of action items 5m
  
  list of actions
- 17:35 → 17:36
  AOB 1m
  1. Item 1