UKI Monthly Operations Meeting (TB-SUPPORT)
Thursday 16 April 2009 -
10:30
Monday 13 April 2009
Tuesday 14 April 2009
Wednesday 15 April 2009
Thursday 16 April 2009
10:30
Experiment problems/issues & STEP09
Experiment problems/issues & STEP09
10:30 - 10:50
STEP is the new CCRC. Scheduled for (May/)June 09. CMS: STEP09 plans: http://tinyurl.com/dd5bhq – Need to test analysis at high scale including stage out of products to destination T2 (chaotic stage out, any T2 permutation allowed) • Requires deployment of efficient stage‐out handling of the analysis infrastructure (CRAB server) – Backup: introduce more complicated user workflows requiring two steps (analysis with local stage‐out, merging with WAN stage‐out) – Need to stress test user access to conditions via Frontier at T2’s • Various user workflows (private MC production, partial rereconstruc tions, skimming, ... ) might access conditions information • Test how the T2 Frontier systems and the global master Frontier servers cope with analysis load at scale – If requested, analysis load can be increased to required scales for multi‐VO tests Scope: increase scale of analysis at T2’s to 150k‐200k jobs/d – test user stage‐out scenarios at scale – test Frontier system behavior Current CMS status overview: http://tinyurl.com/dchre7 LHCb: STEP09 plans: http://tinyurl.com/csx9dn Mainly T1 work. ATLAS: Expects all supporting sites to be available for STEP09. Plans: http://tinyurl.com/czx6rm Data distribution: - Test the full ATLAS data placement model including tape (RAW) writing: T0 tape, T0->T1 (disk), T0->T1(tape), T1->T1 (disk),T1->T2(disk) - run calibration data distribution 4x T0->T2 (disk) - All T2’s must participate unless they sign out (before May 15) Simulation production: - HITS production in T2’s and upload to T1’s - 15,000 jobs/day exclusively in Tier‐2’s - Merged AOD’s distributed to other clouds T1->T1’s and T1->T2’s Reprocessing: - Merged ESD’s and AOD’s distributed to other clouds (T1->T1’s and T1-> T2’s) - Other VO issues -- camont plan to do more processing - please support the activity (see next item) -- Please support more EGEE VOs and requested all sites to consider enabling a few new VOs! (Examples EUMEDGRID, EELA) - yes we still need to update the GridPP supported VOs page. -- Any new issues in this area?
10:50
camont
camont
10:50 - 11:05
- Plan to extend activities from image processing to text processing - Details of upcoming increased job throughput. - Questions and issues
11:05
Publishing DNs with APEL
Publishing DNs with APEL
11:05 - 11:15
- We have been asked to start publishing encrypted DNs - This allows accounting to be viewed at the user level - This item is to discuss what needs to be done. - Also see the APEL FAQ and this ticket: https://gus.fzk.de/ws/ticket_info.php?ticket=47750.
11:15
ROC/WLCG stuff
ROC/WLCG stuff
11:15 - 11:20
ROC update *************** What is the status for sites moving to SL5 WNs? T1 news ********** - TBC from T1 guys WLCG update *************** GDB on 8th April: http://indico.cern.ch/conferenceDisplay.py?confId=45474 Topics covered: - STEP09 -- covered above. - EGEE authorisation service - Identity management - levels of assurance -- About developments in the CA world and GDB requirements -- Plans for short-lived credentials service -- Plans for Member Integrated Credential Service - Distributed monitoring in EGEE -- Steve Traylen reported on face-to-face meeting of 8th Feb -- lots on Nagios and where to "keep" static information -- included discussion on possible change of SAM portal (-> myOSG) Pakiti campaign –Many sites not applying security patches (vanilla SL3 distributions!), a wide range exploits exist in the wild. –OSCT will establish a Pakiti server to collect and evaluate information about the sites‘ patching status –The middle-term goal is to move the Pakiti framework to Nagios SCAS -- Now being deployed at 3-4 sites for production testing -- See Christoph Witzig's talk for the workflow involved (http://indico.cern.ch/materialDisplay.py?sessionId=7&materialId=0&confId=45474) - Middleware update -- WMS on SL4 has shown more problems following update -- SCAS/glexec is certified and now on PPS -- SL5 WNs available since 25th March. No major issues but little deployed. " We know that lcgManageVOTag and lcgtag don’t work. Fixes already provided; patches in “Ready for Certification” -- SL5 UI being worked on. Some pre-certification issues. -- SL5 DPM & LFC first half May. - Pilot jobs -- LHCb and CMS ok -- ATLAS - some desirable features are work in progress - SRMv2.2 - Monitoring in OSG
11:20
Site issues?
Site issues?
11:20 - 11:25
- Quick look at monitoring and accounting status - Accounting problems noted for (http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php): -- UCL-Central (start April) -- MAN-HEP (Jan) -- ECDF (end March) -- RALPP & IC-LeSC (1 week) General issues: -- UCL-Central -- ECDF Any major Q1 issues for the operations quarterly report?
11:25
AOB
AOB
11:25 - 11:30
- The next HEPSYSMAN 30th June - 1st July