WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (VRVS (Desert room))

28-R-15

VRVS (Desert room)

Maite Barroso
Description
VRVS "Desert" room will be available 15:30 until 18:00 CET
actionlist
minutes
    • 14:00 17:05
      28-R-15

      28-R-15

      • 16:00
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:05
        Grid-Operator-on-Duty handover 5m
      • From CERN (backup: Russia) to DECH (backup: Taiwan and UK/Ireland)

      • 9/08/06: escalation - asking for suspension of FZK-PPS from ROC DECH because the site is very unstable - discuss on the next meeting if there is no reply from ROC this week
  • 16:10
    SC4 weekly report and upcoming activities 10m
    Speaker: Harry Renshall
    document
  • 16:20
    Next set of upgrades to gLite 3.0 (3.0.2): status update 5m
    Summary of the situation:
    - the SFTs in PPS had some instabilities and only run 2-3 days last week
    - Problems with the release reported from PPS:
    * Problem with edg-fetch-crl on TAR_WN glite 3.0.2 (bug 18941), not critical:
    http://savannah.cern.ch/bugs/?func=detailitem&item_id=18941
    * R-GMA registry not updated in PPS (bug 18935), critical:
    http://savannah.cern.ch/bugs/?func=detailitem&item_id=18935
    As the R-GMA registry problem was considered critical, it was decided to keep the release in thepre-production service one week more. If the problem is not solved by next Monday, the release will anyway go to production, as the registry is a single service run at RAL, and it will not be updated.
  • 16:25
    Issues to discuss from reports 25m

    Reports were not received from ROCs: Italy, Russia, SouthEasternEurope
    Tier-1s:
    VOs:

  • 1. AP: TW-NCUHEP has installed glite-CE. Is it possible for lcg-RB to submit jobs to glite-CE successfully? Does SFT try to submit jobs to glite-CE using lcg-RB?
  • 2. CE: Information System instabilities (large sites). It looks like the fix given at: http://goc.grid.sinica.edu.tw/gocwiki/LCG_Release_Fixes section: "Information System Instabilities" doesn't improve the situation much for some sites. The fix tells to move site GIIS to another machine. That's avoid "site GIIS down" problem, however the information providers (MDS'es sitting on port 2135) are still on overloaded machine with PBS server. That results in "CPU count erratic" and "missing services" error in Gstat. We'd rather prefer to move out PBS server out of CE than move GIIS. Does this fix work for the others?
  • 3. CE: Connected with the above: We reported the problem to LHCb (the VO causing excessive load) and the VO said the excessive load is caused due to many jobs *fails* at the site (GGUS ticket id: 10716). As the only remedy to save the site they closed LHCb queue. Is that possible for LHCb failing jobs to not cause such an overload on CE? We don't notice such a problem with other VOs. (The site can't fix the problem as when they tried to find the origin of the problem the site disappeared from the web page given by the VO: https://webafs3.cern.ch/santinel/cgi-bin/logging_info There were many successfully done jobs so digging through local logs and finding those failing is said to be rather hard).
  • 4. DECH: One of our sites (SCAI) is planning a DC with biomed in October. The docking application needs the DAG features. When will the RBs of lcg flavour support all gLite-WMS features (esp. DAG) or otherwise, when will glite-WMS be available on all production sites?
  • 5. DECH: High load on CE: FZK proposes an improved jobmanager script: (calls now simple 'qstat' instead of 'qstat -f' in order to reduce load and network utilization of CE and of the PBS server, details: http://goc.grid.sinica.edu.tw/gocwiki/High_%28network%29_load_on_PBS_server_and_CE_caused_by_JobManager)
  • 6. NE: PDC asks why the per-node monitoring choices in GOCDB are not honored. Why do they get SFTs directed to a gLite-CE that they know is not fully working yet and thus has unchecked the monitoring in the GOCDB?
  • 7. NE: SARA experienced problems with the RB. One of the MySQL tables ran into a 4 GB limit. Therefore, it was not possible to register any new jobs. We have increased the MAX_ROWS for the tables. This solved the job registration problem but the RB still wasn't working correctly. In order to limit the down time of the RB we have completely reinstalled it. Now it is operational again. Sites/locations who have a older RB setup running (without large row limits) may encounter the same issue at sometime when one of the SQL tables reaches 4 Gb in size. Appareantly this will not be fixed/checked by any updates, since the old SQL tables (and limits) will be retained in that process. Also detecting the problem isn't trivial, there were no clear error messages, only by a hint of Maarten Litmaath the problem was located.
  • 8. SWE (from last week): The high load on the CE caused the GRIS (slapd:2135) to respond very slowly. We tried to move the local info provider on the CE from the GRIS on 2135 to a BDII on 2170 that just triggers the GIP, and publishes the information under mds-vo-name=resource,o=grid. We would like to hear from IS experts wether this is an acceptable solution.
  • 9. US: We (ROC_US) are being given tickets for a Canadian site. As evidenced by the following ticket GOC 2533
    " I do not have permission to source the athena setup file at this site:
    ./marksReco.sh: line 5: /opt/exp_software/atlas/software/11.0.41/setup.sh: Permission denied
    Cheers,
    Mark
    Other GGUS Ticket Info:
    Submitter Name: mark hodgkinson
    Submitter Email: m.hodgkinson@sheffield.ac.uk
    Submitter Phone: HASH(0x957f91c)
    Experiment: atlas
    Date GGUS Ticket Opened: 2006-08-11 10:33
    Short Description: incorrect file permissions at lcg-ce.lps.umontreal.ca:2119
    Solution: HASH(0x95e7850) "
    What are the correct way to handle this error in ticket routing?
  • VO REPORTS:

  • Alice: T0-T1 transfers STATUS:
    LYON: working fine
    FZK: Does not work. Problem withone port. A ticket has been submitted one week ago and still no answer (GGUSTICKET:11085)
    CNAF: Does not work. Problem with one disk server. A ticket hasbeen submitted one week ago and still no answer (GGUS TICKET:11084)
    SARA and RAL: Not yet tested
  • 16:50
    Review of action items 10m
    more information
  • 17:00
    AOB 5m