WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2006-11-06T16:00:00+01:00
End: 2006-11-06T18:00:00+01:00
Location: CERN conferencing service (joining details below)

Monday 6 Nov 2006, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from
the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and
escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610

OR click HERE

- 16:00 → 16:05
  
  Feedback on last meeting's minutes 5m
- 16:05 → 16:30
  EGEE Items 25m
  - <big> Grid-Operator-on-Duty handover </big> 5m
    
    From ROC Central Europe (backup: ROC SouthEast Europe) to ROC Germany/Swiss (backup: ROC SouthWest Europe)
    Cases suggested to discuss during operations meeting:
    
    INDIACMS-TIER Site in SD since 2006-09-15, but site is working on the problem.
    BEgrid-KULeuven - last escalation step, mail sent, no answer
    PAKGRID-LCG2 - last escalation step, mail sent, no answer
    
    Alarms turned off - 269
    Tickets open - 59
    Site OK - 16
    Closed tickets - 14
    2nd e-mail - 14
    Quarantine - 5
    SAM results were not available sometimes (GGUS tickets opened), outdated results (GGUS ticket) 15122, 15134 (assigned).
  - <big> Update on SLC4 migration </big> 5m
  - <big> EGEE issues coming from ROC reports </big> 10m
    
    Reports were not received from these ROCs: NorthernEurope, Russia, SouthEasternEurope, US-OSG
    Reports were not received from these non-HEP VOs: EGrid, BioMed
    
    UK/I Comments We have had many complaints about the site tests this week. As a region we are not happy for our availability to be judged on the basis of the current test results. Some specific problems behind this complaint are given below with the hope that (where appropriate) they will also be followed up at the COD meeting this week.
    
    Tests are running on sites that have marked themselves as in scheduled downtime. Case in point ce.epcc.ed.ac.uk 30-10-2006 12:56 -23:54. Failing sft-job but in maintenance. Is this a monitoring problem or something the site did wrong?
    SAM was supposed to reduce errors arising off-site and affecting many sites. We are still seeing many rm errors being logged as site failures when the error points to CERN "CASTOR error....". Example ce1-gla.scotgrid.ac.uk on 02-11-2006 at 11:17. See also UCL-HEP entries. What about heuristic testing? If >50% of sites fail then something must be wrong centrally.
    A ticket raised as critical has not received a response: "open ticket is critical: https://gus.fzk.de/pages/ticket_details.php?ticket=14737 We are now not matching against FCR tools because our last published test was so long ago." So in this case because of a central problem with submitting the tests the site was effectively removed from production. Where is the resilience one expects of a production grid?
    After the maui/torque vulnetability many of our sites closed their queues and as a consequence failed the SAM tests. In the testing this counted against UKI regional availability - this seems inappropriate.
    The SFT pages are still available - https://lcg-sft.cern.ch:9443/sft/lastreport.cgi?action=Configure. Several admins continued to use them when the ops switch happened only to later find that they were failing the ops VO tests but passing the dteam tests. If the SFTs are no longer being used then to avoid confusion it would be best to remove the results page. In addition the presentation of information under SAM is found to be less helpful than under SFT pages
    With tests now running as ops it is more difficult for sites to debug problems. In addition, within UKI we have encountered many problems with the "on-demand" test pages especially for sites that are uncertified. There should be an end-to-end test of the system by the test service provider.
    The SAM web service is not reliable. This goes against the idea that the service would be available for 3rd parties to develop tests. This is based on the finding that our RB tests for example frequently have to retry many times to publish into SAM.
    On to other things!
    
    As a ROC we often get informed of new VOs registered within EGEE. This arrives as an email and we are not able to consider each entry when it arrives. Is there an online repository for this information?
    There is an outstanding action on UKI (2006-08-14—2) to "prove" there is a problem with additional rpms needing to be requested by experiments/VOs in order to get their software functioning at a site. At our collaboration meeting (GridPP17) last week BABAR suggested this was often a problem they encountered outside of the UKI region too (Italy was mentioned). Can other regions please respond if they believe we need to seek meta-rpms from experiments/VOs that ensure all required rpms are installed at sites after the basic installation?
    How do we test a new ce without having to setup a new separate site.
    
    SouthWestern Again, this week some of the CEs have been killed by some users submitting jobs through several RBs. For us, this is a serious operational issue. For the moment, the only thing we can do to keep our CE up is to block the access of the DN of the particular user. We believe this is not completely satisfying for the VO. Could the VOs confirm wether their productions do need the usage of many RBs per user? If this is the case, the reasons for this should be identified and addressed somehow.
- 16:30 → 16:50
  
  OSG Items 20m
  
  No items for discussion.
- 16:50 → 17:35
  WLCG Items 45m
  - <big><a href="https://twiki.cern.ch/twiki/bin/view/LCG/WeeklyServiceReports">WLCG Service Report</a></big> 15m
  - <big> WLCG Service Commissioning report and upcoming activities </big> 15m
    
    See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans
    
    Speaker: Harry Renshall
  - <big> WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports </big> 15m
    
    Reports were not received from:
    Tier-1 sites: NDGF
    VOs: CMS, ATLAS, LHCb ALICE's Report
    
    Since Friday at 22:00 till Sat at 14:00 a very good rate was achieved (337MB/sec)
    From Sat 14:00 to Sun 20:00 the rate dropped to 133MB/sec. no clear explanation for this. It has been asked during the daily meeting at 9:00, but not clear explanation right now.
    Since yesterday at 20:00, the RFCP are failing with the status Device or resource busy. CASTOR2 team is looking inside the problem. Not only the MDC is affected but the whole alice activity on this stager including the WAN transfers.
    Yesterday night at 00:30 the central Proxy AliEn? service crashed. It was restarted this morning at 08:30. No transfers have been therefore submitted tonight
    Inestabilities affecting SRM put has been observed on Friday at FZK and CCIN2P3. These issues decreased the efficiency at those sites to 43% and 56% respectively. These inestabilities dessapeared during Saturday the 4th
    The 5th again inestabilities at CCIN2P3 were observed with the error: The FTS transfer transferid failed ( Transfer failed. ERROR an end-of-file was reached)
    
    paper
- 17:35 → 17:55
  
  Review of action items 20m
- 17:55 → 18:00
  
  AOB 5m