WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2008-03-10T16:00:00+01:00
End: 2008-03-10T18:00:00+01:00
Location: CERN conferencing service (joining details below)

Monday 10 Mar 2008, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs: UKI

VOs: Alice, Atlas, LHCb

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
  
  Minutes
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From: Italy / SWE
    To: Russia / DECH
    
    Issues:
    
    To comment that on thursday, 6th of March, the COD portal was unavailable.
    Backup Team (SouthWestern Europe):
    - 1st mail: 18
    - 2nd mail: 18
    - Site OK: 31
    - Quarantine: 5
    Total: 72
  - <big> PPS Report & Issues </big>
    
    PPS reports were not received from these ROCs:
    AP, IT, SEE, SWE
    
    Issues from EGEE ROCs:
    
    None reported
    
    Release News:
    
    Glite 3.1.0 PPS Update 20 finished the pre-deployment phase and it is now available in the public PPS repository.
    In particular this update contains
    
    glite-MON for gLite 3.1 / SL4
    Release notes in:
    https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update20
    Pre-deploymetn tests reports in:
    http://www.cern.ch/pps/index.php?dir=./release/testreports/gLite3.1.0/
    https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update20
  - <big> EGEE issues coming from ROC reports </big>
    
    (ROC CE): Two questions about availability calculation.
    a) Could we present what fraction of unavailability periods is considered by sites as non-relevant? Site admins fills in weekly reports and put such an information about each individual SAM test failure so the data is there. In our view this information can allow to identify areas to improve in terms of availability.
    b) Would it be possible to implement mechanisms for automatic removal of periods in which sites failed due to some monitoring-related problems like this one: https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=grid.uibk.ac.at&vo=OPS&testname=CE-host-cert-valid&testtimestamp=1204109361
  - <big> gLite Release News</big>
    
    gLite3.1 Update16 was released to production today
    The update contains:
    
    A new index on the attribute GlueServiceEndpoint, used by lcg-utils
    UI: Bug fixes to jdl API (bulk submission) and gfal cliens
    dcache SE: Glue 1.3 clean ups and bug fixes
    DPM SE: version 1.6.7 (32-bit and 64-bit) fixing various configuration bugs; introducing new front-ends for Xroot and HTTP/HTTPS; upgrading the version of gSOAP from 2.6.2 -> 2.7.6b
    GFAL version 1.10.8-1: creation of subdirectories with lcg-utils
    lcgCE: bug fixing
    Release notes:
    http://glite.web.cern.ch/glite/packages/R3.1/updates.asp
- 16:30 → 17:00
  WLCG Items 30m
  - <big> WLCG issues coming from ROC reports </big>
    
    (France ROC): A lesson learnt from CCRC08 is that some VOs don''t mind the status published by a CE queue, so that they can wrongly submit on queue with a non-Production status. Indeed, at IN2P3-CC, for the purpose of a Atlas-Cms combined test, we had set 2 queues with a status "TEST" in order to restrict access to jobs that had explicitely required this status, but after a while we noticed plenty of regular (I mean "production") jobs on those queues. Please check the queue status before submitting, it must be set at "Production".
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    Time at WLCG T0 and T1 sites.
  - <big> CCRC'08 Operational Review </big>
    
    Item 1
    
    Speaker: Harry Renshall / Jamie Shiers
  - <big> Alice report </big> 5m
  - <big> Atlas report </big> 5m
    
    Request to Atlas sites to upgrade WNs to SL4
    List of siteshaving CEs that atlas can use, by OS: http://straylen.web.cern.ch/straylen/tmp/atlas-sites-by-os.txt
    
    more information
  - <big> CMS report </big> 5m
    
    Data certification, T0 status and reprocessing:
    all activities suffered from the LSF incident (full log by CMS at https://twiki.cern.ch/twiki/bin/view/CMS/FacOps-IncidentCERNLSF-Feb28Mar07, discussed with Bernd/Ulrich at the FacOps meeting - see bottom of http://indico.cern.ch/conferenceDisplay.py?confId=30054). Hard week for RelVal atCERN, also (the LSF issue left CMS behind in release validations). FastSim production was proceeding fast before the problems (6k/15k proc jobs complete), and recovered soon after. --- Good progress on the StorageManager side, identified and configured the nodes to be used in the Global Run inMarch.
    Re-processing:
    on CSA07 signal workflows, ~6M of GEN-SIM input evts have just arrived at T1's; ~17M processed evts last week. Processing running at FNAL also. FastSim production finalized with CMSSW_1.6.9 (+ 2 additional tags for the config files CMSSW_1.6.10) about ~100M PDAllEvents from the 3 soups (RelVal samples). No site issues at ASGC, CNAF, FZK, PIC, RAL; at FNAL, jobs take too long due to a dCache issue, being investigated; at IN2P3, problems in the pool area, several days without being able to merge jobs, now solved and production is already back on-schedule. --- Ran some post-CCRC reprocessing jobs with ATLAS: some lessons learned at IN2P3 and PIC (long to report here).
    MC production:
    ~85M CSA07 Signal requested events were done, now available for reco. 56 workflows for ~3M requested events still to be done. Two types of problems (all CMSSW-related, so it worths no mention here). 4 finished datasets (4M events, 1.45TB) are subscribed but not yet transferred to any T1 MSS. --- 1 DPG workflow (2 Mevts): GEN-SIM is done. Transferring. --- HLT: running (it's CMSSW_1_7_4, GEN-SIM-DIGI-RAW), 1 big workflows (10 Mevts) in production now, ~2 Mevts are done. --- Detailed summary of current production activities at http://khomich.web.cern.ch/khomich/csa07Signal.html.
    Data Transfers and Integrity, DDT-2/LT status:
    /Prod transfers: proceed, 16 TB/week this week, no major problems. /Debug transfers: new links are commissioning with the new DDT-2 metric exclusively, since February 11th. Link exercising is proceeding, generally very successfully: 78% of the previously commissioned links have already PASSED the new metric as of 6 March 6th. We have 285 commissioned links (as of March 6th). The breakdown is: 55/56 T[01]-T1 crosslinks (only ASGC->RAL is missing); 142 T1-T2 downlinks and 83 T2-T1 uplinks, 38 T2 have at least 1 downlink and 37 T2 have at least 1 uplink, the interception is 35 T2 that have both; 5 T2-T2 links. First round of testing almost complete. Sites can take advantage of the gap before the second round to commission new links or recommission failed links. Real problems found, fixed during exercising, first "success stories" in troubleshooting being documented. --- Full details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising.
    
    Speaker: Daniele Bonacorsi
  - <big> LHCb report </big> 5m
- 17:00 → 17:30
  OSG Items 30m
  
  Speaker: Rob Quick (OSG - Indiana University)
  - Discussion of open tickets for OSG
- 17:30 → 17:35
  
  Review of action items 5m
  
  list of actions
- 17:35 → 17:36
  AOB 1m
  1. Item 1