lecture WLCG-OSG-EGEE Operations meeting
Date/Time: Monday, 15 January 2007 - 16:00 (Europe/Zurich)
Location: CERN conferencing service (joining details below) ( 28-R-15 )
Chairperson:
Description: grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    Material: list of actions link minutes link

     
     Monday, 15 January 2007
     16:00
    Feedback on last meeting's minutes (5')   minutes link    
     16:05
    EGEE Items (25')    
    • Grid-Operator-on-Duty handover (5')
      From ROC CERN (backup: ROC France) to ROC Russia (backup: ROC Taiwan)

      Tickets:
      New tickets :
      2nd email:
      Qurantine:

      Notes:

    • Nothing requires handing over this week.
     
    • PPS reports (5')
      PPS reports were not received from these ROCs: AP, France, Italy
    Nicholas Thackray (CERN)  
    • Major Operational Issues Encountered During the Reporting Period (5')
      1. 2 new site certified: KR-KISTI-GCRT-01 & HK-HKU-CC-01 (AsiaPacific)


      2. goc.grid.sinica.edu.tw will be in maintenance on Jan 16, 3:00-12:00 UTC (AsiaPacific)


      3. Within Regional Certification task we prepared first version of "WMS experimental performance results", for more information look here:
        http://wiki.grid.cyfronet.pl/RegionalCertification/WMSLB-3.0.2u10-epr
        Comments welcome. One of the conclusions is that using one WMS machine during 24h we can submit in typical (not bulk submission) way ~1 080 000 jobs (very simple jobs) using several clients in parallel. (CentralEurope)


      4. Certification of two sites is still active, SDU-LCG2 in China and Indiana in the USA. In the last week both have made progress towards certification (CERN)


     
    • EGEE issues coming from ROC reports (10')
      Reports were not received from these ROCs:
      Reports were not received from these non-HEP VOs:

      1. On the GLite CE, process: 8208 dteam046 25 0 87348 85M 1268 R 26.8 17.0 0:19 0 grid-proxy-init It goes up to 140MB! What can this be doing? (Triumf, CERN)


      2. Renewing host certs is a minefield due to non-root owned copies. Typically they are put there by yaim install. If they were put there by init.d start scripts, or some other function, then I wouldn`t have to search. So far FTS, LFC, RGMA ,gLite CE LB have all wasted my time(Triumf, CERN)


      3. GGUS Ticket #16771 is not in progress since December, 18th 2006 (DECH)


      4. Hisory of SAM Test Results is not long enough. IMHO should cover at least last reporting period(DECH)


      5. Still Discrepancy SAM <-> CIC Report!? "a01-004-128.gridka.de" and "ce- fzk.gridka.de" reversed? .. See GridKa Report: "Guys I dont get it. We have a none failure history in the sam pages but this report thinks otherwise. I have no interest in inventing repairs to problems that did not exist. Explain please! ..."(DECH)


      6. IFAE site is still having problems with APEL. They were in contact with Dave Kant, but still waiting for an e-mail answer from last week. They feel a closer contact would help in the debugging process(Ifae, SWE)


     
     16:30
    OSG Items (20')    
    Joint InterOperations Meeting being moved to Indianapolis rather than Bloomington.
    Also, Monday is a holiday here, so we may not attend the call.
     16:50
    WLCG Items (45')    
     
    Harry Renshall  
    • WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports (15') Tier-1 reports pdf file  
      Reports were not received from:
      > Tier-1 sites:

      > VOs: NO VO REPORT RECEIVED

    • ATLAS: Cns_srv_mkdir: returns 13 Error
      DPM errors:
      01/12 16:13:47 2697,0 Cns_srv_mkdir: NS092 - mkdir request by /C=CA/O=Grid/OU=westgrid.ca/CN=Rodney Walker (113,249) from grid006.mi.infn.it
      01/12 16:13:47 2697,0 Cns_srv_mkdir: NS098 - mkdir /dpm/mi.infn.it/home/atlas/dq2/caldigoff1_mc12/log 775 0
      01/12 16:13:47 2697,0 Cns_srv_mkdir: returns 13
      To solve these errors, you need to set ACLs for all ATLAS VOMS groups and roles under /dpm/.../home/atlas.

    • LHCb is fire fighting since a week with the status of T1's SEs which is very bad and prevent reliable transfer (either going through FTS or through low level utility allowing third party transfer).
      Many transfers are failing from almost everywhere to almost everywhere (as also evident from LHCb T1-T1 matrix connection suite:
      http://santinel.home.cern.ch/santinel/cgi-bin/lhcb
      This SE health status at T1s turns into a huge backlogs formed (through their fail-over mechanism) at various VO-boxes by attempting to transfer asynchronously output from MC simulation jobs that failed to put in the destination SE; the general bad state of SE is also preventing to redistribute data around for running reconstruction of T1 centers slowing down all their reconstruction activities.
      From this ops meeting LHCb would like to know:
      are the other VOs experiencing problems in moving data around?
     
     17:35
    Review of action items (20')   list of actions link    
     17:55
    AOB (5')