lecture WLCG-OSG-EGEE Operations meeting
Date/Time: Monday, 16 October 2006 - 16:00 (Europe/Zurich)
Location: CERN conferencing service (joining details below) ( 28-R-15 )
Chairperson:
Description: grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    Material: list of actions link minutes link
    Monday, 16 October 2006 16:00 ->17:40 WLCG-OSG-EGEE Operations Meeting (28-R-15)    

     
     Monday, 16 October 2006
    WLCG-OSG-EGEE Operations Meeting (16:00 ->17:40 )
    Chairperson:
     16:00
    Feedback on last meeting's minutes (5')    
     16:05
    EGEE Items (25')    
    • Grid-Operator-on-Duty handover (5')
      From ROC Asia-Pacific (backup: ROC Central Europe) to ROC France (backup: ROC Russian)

      Tickets:
      New tickets : 21
      2nd email: 16
      Qurantine :29
      No sft results for many sites , so we extended the following tickets to 18/10 : (ID 2857 , 3152 , 3148 , 3146, 3144 ,3140 ,3139, 3110 , 2615 )

      Notes:

    • Since SFT works now, it is probably best to handle the tickets that were extended. We will not extend these tickets in the future but maintain their status until SFT recovers.
    • Even though Monday and Tuesday was public holidays, we where able to work on CIC task on those two days. The ticket statistics for those days are not included in the figures in the previous log.
     
    • Update on SLC4 migration (5')
    Alberto Di Meglio  
    • ALICE use of the SGM account (10')
      THIS ISSUE HAS BEEN REPORTED BY ALICE AND BY THE SITES A FEW TIMES. LET'S TRY TO UNDERSTAND IT AND FIND A SOLUTION TODAY:

      The ALICE model observes the job submission procedure to be performed through the VOBOXES deployed at each site (T0, T1 and T2). From this point of view, the VOBOX is the entry door to the Grid for the experiment and it acts as a UI (configuration included inside the VOBOX)
      The VOBOX configuration gives access to sgm persons only, therefore the job submission is being performed at this moment using this account.
      From the ALICE point of view they do not have any prevalence to submit jobs through using the sgm acocunts, they do it like that just because the access to the VOBOX is limited to sgm accounts.
      NIKHEF triggered the fact that they are going to limit the number of CPUs to sgm accounts. So this will be a problem for ALICE if all sites decides to do the same.

      Harry Renshall's proposal:
      Is it not the case if the grid UI are used to submit production you get the voms mapping applied - in which case alicesgm logs in to vo box and to submit production does a voms-proxy-init --voms alice:/Role=production then the certificates sent with the submitted jobs will map to aliceprod rather than alicesgm. The only requirement is that alicesgm be registered in vomrs with production role as well as lcgadmin role. Presumably you can switch roles when you need ?
    Patricia Mendez, Harry Renshall  
    • LFC 1.5.10 (15')
      ATLAS requires that all their related Tier-1s upgrade to LFC 1.5.10 in the timeline of 2 weeks after it is made available to the production infrastructure.
      1. LFC 1.5.10 will be deployed in PPS on Monday 16th October (today). Atlas PPS users should be informed when the deployment is finished so they test it

      2. If no problems are found, it will be released to production one week later
     
    • Tier-1 SRMs not supporting ops VO (5')
      Some Tier-1s SRMs are not supporting the ops VO, and are thus failing the FTS 
      tests.
     
    • top level BDII overloaded (5')
      ops RM failures now happen because of lcg-bdii.cern.ch timing out. What might happen is if file upload fails than all the other RM tests will fail too (download, replication, delete).
      This problem affects many other sites using this BDII and there is already an open ticket for this (GGUS#13873).
     
    • EGEE issues coming from ROC reports (10')
      Reports were not received from these ROCs: France
      Reports were not received from these non-HEP VOs: BioMed

      1. DESY reported permanent problems with completely flooded /tmp dirs by atlas jobs on the WNs. It seems, short atlas jobs try to create persistent data in /tmp which are supposed to be used by later jobs. Unfortunately, a 100% full /tmp dir leads to immediate abortion of subsequent jobs. The WN starts to act as a black hole. DESY has set up a simple script to purge dirs in /tmp which exceed 500MB. Technical details: Typical subdir names are: /tmp/execute.131.169.160.210-14618 . The IP- address is the one of the WN. The dir /tmp is on a seperate partion with 1 GB space. The /home partition, where the jobs are placed, has plenty of space (>50GB). (DECH)


      2. Comment by PIC: last week quite a lot of SAM errors seen in the glite-CE with the following error message:
        - host = rb101.cern.ch
        - reason = Got a job held event, reason: "The job attribute PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE"
        The source of this error is not clear for us. Is this seen by other sites with glite- CEs? SWE


      3. Last week there was an event of "bad use of grid resources" detected at one of the SWE sites. One user was using the WN to download video files. What to do in these cases? (SWE)
     
     16:30
    OSG Items (20')    
    No items for discussion.
     16:35
    WLCG Items (45')    
     
    Harry Renshall  
    • WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports (15') more information unknown type filedown arrow  
      Reports were not received from:
      > Tier-1 sites:
      IN2P3; FNAL; BNL
      > VOs: ALICE; ATLAS; CMS

    • The latest issue with the current LHCB VOMS organization(I'd remind it is Groups-based) is LFC. In order to be able to write in LFC (that doesn't support secondary groups) /lhcb/lhcbprod users need to have granted write permission on the LFC namespace /grid/lhcb. I open a remedy ticket for that last week.
     
     17:15
    Review of action items (20')   list of actions link    
     17:35
    AOB (5')