UKI Monthly Operations Meeting (TB-SUPPORT)

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the monthly UKI meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The UK phone bridge is on +44 (0)161 306 6802. The CERN one is: +41 22 76 71400. The phone bridge ID is 934961 with code: 4880. - If the CERN phone connection does not work please try Caltech +1 626 395 2112 or DESY +49 40 8998 1346. - For more information on the UK phone bridge: http://www.ja.net/services/video/agsc/services/evotelephonebridge.html
    • 10:30 10:50
      Experiment problems/issues & STEP09 20m
      STEP is the new CCRC. Scheduled for (May/)June 09. CMS: STEP09 plans: http://tinyurl.com/dd5bhq – Need to test analysis at high scale including stage out of products to destination T2 (chaotic stage out, any T2 permutation allowed) • Requires deployment of efficient stage‐out handling of the analysis infrastructure (CRAB server) – Backup: introduce more complicated user workflows requiring two steps (analysis with local stage‐out, merging with WAN stage‐out) – Need to stress test user access to conditions via Frontier at T2’s • Various user workflows (private MC production, partial rereconstruc tions, skimming, ... ) might access conditions information • Test how the T2 Frontier systems and the global master Frontier servers cope with analysis load at scale – If requested, analysis load can be increased to required scales for multi‐VO tests Scope: increase scale of analysis at T2’s to 150k‐200k jobs/d – test user stage‐out scenarios at scale – test Frontier system behavior Current CMS status overview: http://tinyurl.com/dchre7 LHCb: STEP09 plans: http://tinyurl.com/csx9dn Mainly T1 work. ATLAS: Expects all supporting sites to be available for STEP09. Plans: http://tinyurl.com/czx6rm Data distribution: - Test the full ATLAS data placement model including tape (RAW) writing: T0 tape, T0->T1 (disk), T0->T1(tape), T1->T1 (disk),T1->T2(disk) - run calibration data distribution 4x T0->T2 (disk) - All T2’s must participate unless they sign out (before May 15) Simulation production: - HITS production in T2’s and upload to T1’s - 15,000 jobs/day exclusively in Tier‐2’s - Merged AOD’s distributed to other clouds T1->T1’s and T1->T2’s Reprocessing: - Merged ESD’s and AOD’s distributed to other clouds (T1->T1’s and T1-> T2’s) - Other VO issues -- camont plan to do more processing - please support the activity (see next item) -- Please support more EGEE VOs and requested all sites to consider enabling a few new VOs! (Examples EUMEDGRID, EELA) - yes we still need to update the GridPP supported VOs page. -- Any new issues in this area?
    • 10:50 11:05
      camont 15m
      - Plan to extend activities from image processing to text processing - Details of upcoming increased job throughput. - Questions and issues
      Slides
      slides2
    • 11:05 11:15
      Publishing DNs with APEL 10m
      - We have been asked to start publishing encrypted DNs - This allows accounting to be viewed at the user level - This item is to discuss what needs to be done. - Also see the APEL FAQ and this ticket: https://gus.fzk.de/ws/ticket_info.php?ticket=47750.
    • 11:15 11:20
      ROC/WLCG stuff 5m
      ROC update *************** What is the status for sites moving to SL5 WNs? T1 news ********** - TBC from T1 guys WLCG update *************** GDB on 8th April: http://indico.cern.ch/conferenceDisplay.py?confId=45474 Topics covered: - STEP09 -- covered above. - EGEE authorisation service - Identity management - levels of assurance -- About developments in the CA world and GDB requirements -- Plans for short-lived credentials service -- Plans for Member Integrated Credential Service - Distributed monitoring in EGEE -- Steve Traylen reported on face-to-face meeting of 8th Feb -- lots on Nagios and where to "keep" static information -- included discussion on possible change of SAM portal (-> myOSG) Pakiti campaign –Many sites not applying security patches (vanilla SL3 distributions!), a wide range exploits exist in the wild. –OSCT will establish a Pakiti server to collect and evaluate information about the sites‘ patching status –The middle-term goal is to move the Pakiti framework to Nagios SCAS -- Now being deployed at 3-4 sites for production testing -- See Christoph Witzig's talk for the workflow involved (http://indico.cern.ch/materialDisplay.py?sessionId=7&materialId=0&confId=45474) - Middleware update -- WMS on SL4 has shown more problems following update -- SCAS/glexec is certified and now on PPS -- SL5 WNs available since 25th March. No major issues but little deployed. " We know that lcgManageVOTag and lcgtag don’t work. Fixes already provided; patches in “Ready for Certification” -- SL5 UI being worked on. Some pre-certification issues. -- SL5 DPM & LFC first half May. - Pilot jobs -- LHCb and CMS ok -- ATLAS - some desirable features are work in progress - SRMv2.2 - Monitoring in OSG
    • 11:20 11:25
      Site issues? 5m
      - Quick look at monitoring and accounting status - Accounting problems noted for (http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php): -- UCL-Central (start April) -- MAN-HEP (Jan) -- ECDF (end March) -- RALPP & IC-LeSC (1 week) General issues: -- UCL-Central -- ECDF Any major Q1 issues for the operations quarterly report?
    • 11:25 11:30
      AOB 5m
      - The next HEPSYSMAN 30th June - 1st July