WLCG-OSG-EGEE Operations meeting

Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
    NB: Reports were not received in advance of the meeting from:

  • ROCs: UKI
  • VOs: Alice, Atlas, LHCb
  • Minutes
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: Italy / SWE
          To: Russia / DECH

          1. To comment that on thursday, 6th of March, the COD portal was unavailable.
          Backup Team (SouthWestern Europe):
          - 1st mail: 18
          - 2nd mail: 18
          - Site OK: 31
          - Quarantine: 5
          Total: 72
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, IT, SEE, SWE

          Issues from EGEE ROCs:
          1. None reported

          Release News:
          1. Glite 3.1.0 PPS Update 20 finished the pre-deployment phase and it is now available in the public PPS repository.
            In particular this update contains
            • glite-MON for gLite 3.1 / SL4
            Release notes in:
            Pre-deploymetn tests reports in:
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC CE): Two questions about availability calculation.
            a) Could we present what fraction of unavailability periods is considered by sites as non-relevant? Site admins fills in weekly reports and put such an information about each individual SAM test failure so the data is there. In our view this information can allow to identify areas to improve in terms of availability.
            b) Would it be possible to implement mechanisms for automatic removal of periods in which sites failed due to some monitoring-related problems like this one: https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=grid.uibk.ac.at&vo=OPS&testname=CE-host-cert-valid&testtimestamp=1204109361
        • <big> gLite Release News</big>
          1. gLite3.1 Update16 was released to production today
            The update contains:
            • A new index on the attribute GlueServiceEndpoint, used by lcg-utils
            • UI: Bug fixes to jdl API (bulk submission) and gfal cliens
            • dcache SE: Glue 1.3 clean ups and bug fixes
            • DPM SE: version 1.6.7 (32-bit and 64-bit) fixing various configuration bugs; introducing new front-ends for Xroot and HTTP/HTTPS; upgrading the version of gSOAP from 2.6.2 -> 2.7.6b
            • GFAL version 1.10.8-1: creation of subdirectories with lcg-utils
            • lcgCE: bug fixing
            Release notes:
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. (France ROC): A lesson learnt from CCRC08 is that some VOs don''t mind the status published by a CE queue, so that they can wrongly submit on queue with a non-Production status. Indeed, at IN2P3-CC, for the purpose of a Atlas-Cms combined test, we had set 2 queues with a status "TEST" in order to restrict access to jobs that had explicitely required this status, but after a while we noticed plenty of regular (I mean "production") jobs on those queues. Please check the queue status before submitting, it must be set at "Production".
        • <big>WLCG Service Interventions (with dates / times where known) </big>
        • <big> CCRC'08 Operational Review </big>
          • Item 1
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big> 5m
        • <big> Atlas report </big> 5m
          1. Request to Atlas sites to upgrade WNs to SL4
            List of siteshaving CEs that atlas can use, by OS: http://straylen.web.cern.ch/straylen/tmp/atlas-sites-by-os.txt
          more information
        • <big> CMS report </big> 5m

          • Data certification, T0 status and reprocessing:
            all activities suffered from the LSF incident (full log by CMS at https://twiki.cern.ch/twiki/bin/view/CMS/FacOps-IncidentCERNLSF-Feb28Mar07, discussed with Bernd/Ulrich at the FacOps meeting - see bottom of http://indico.cern.ch/conferenceDisplay.py?confId=30054). Hard week for RelVal atCERN, also (the LSF issue left CMS behind in release validations). FastSim production was proceeding fast before the problems (6k/15k proc jobs complete), and recovered soon after. --- Good progress on the StorageManager side, identified and configured the nodes to be used in the Global Run inMarch.
          • Re-processing:
            on CSA07 signal workflows, ~6M of GEN-SIM input evts have just arrived at T1's; ~17M processed evts last week. Processing running at FNAL also. FastSim production finalized with CMSSW_1.6.9 (+ 2 additional tags for the config files CMSSW_1.6.10) about ~100M PDAllEvents from the 3 soups (RelVal samples). No site issues at ASGC, CNAF, FZK, PIC, RAL; at FNAL, jobs take too long due to a dCache issue, being investigated; at IN2P3, problems in the pool area, several days without being able to merge jobs, now solved and production is already back on-schedule. --- Ran some post-CCRC reprocessing jobs with ATLAS: some lessons learned at IN2P3 and PIC (long to report here).
          • MC production:
            ~85M CSA07 Signal requested events were done, now available for reco. 56 workflows for ~3M requested events still to be done. Two types of problems (all CMSSW-related, so it worths no mention here). 4 finished datasets (4M events, 1.45TB) are subscribed but not yet transferred to any T1 MSS. --- 1 DPG workflow (2 Mevts): GEN-SIM is done. Transferring. --- HLT: running (it's CMSSW_1_7_4, GEN-SIM-DIGI-RAW), 1 big workflows (10 Mevts) in production now, ~2 Mevts are done. --- Detailed summary of current production activities at http://khomich.web.cern.ch/khomich/csa07Signal.html.
          • Data Transfers and Integrity, DDT-2/LT status:
            /Prod transfers: proceed, 16 TB/week this week, no major problems. /Debug transfers: new links are commissioning with the new DDT-2 metric exclusively, since February 11th. Link exercising is proceeding, generally very successfully: 78% of the previously commissioned links have already PASSED the new metric as of 6 March 6th. We have 285 commissioned links (as of March 6th). The breakdown is: 55/56 T[01]-T1 crosslinks (only ASGC->RAL is missing); 142 T1-T2 downlinks and 83 T2-T1 uplinks, 38 T2 have at least 1 downlink and 37 T2 have at least 1 uplink, the interception is 35 T2 that have both; 5 T2-T2 links. First round of testing almost complete. Sites can take advantage of the gap before the second round to commission new links or recommission failed links. Real problems found, fixed during exercising, first "success stories" in troubleshooting being documented. --- Full details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising.
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big> 5m
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
      • 17:30 17:35
        Review of action items 5m
        list of actions
      • 17:35 17:35
