Deployment team

Europe/Zurich
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area The phone bridge number is +41 22 76 71400. Th phone bridge ID is 124399 with code: 4880.
Minutes
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
    • 11:20 11:35
      ROC update 15m
      EGEE-WLCG ops ******************* From last week: - Revisited questions about the availability calculation (e.g. Would it be possible to implement mechanisms for automatic removal of periods in which sites failed due to some monitoring-related problems). - gLite3.1 Update16 was released to production today The update contains: * A new index on the attribute GlueServiceEndpoint, used by lcg-utils * UI: Bug fixes to jdl API (bulk submission) and gfal clients * dcache SE: Glue 1.3 clean ups and bug fixes * DPM SE: version 1.6.7 (32-bit and 64-bit) fixing various configuration bugs; introducing new front-ends for Xroot and HTTP/HTTPS; upgrading the version of gSOAP from 2.6.2 -> 2.7.6b * GFAL version 1.10.8-1: creation of subdirectories with lcg-utils * lcgCE: bug fixing - From France "A lesson learnt from CCRC08 is that some VOs don''t mind the status published by a CE queue, so that they can wrongly submit on queue with a non-Production status." - Detailed report from CMS on: Data certification, T0 status and reprocessing; Re-processing (jobs take too long at FNAL due to a dCache issue); MC production (for details see http://khomich.web.cern.ch/khomich/csa07Signal.html); Data Transfers and Integrity, DDT-2/LT status (55/56 T[01]-T1 crosslinks (only ASGC->RAL is missing). Documentation associated with activities is improving - e.g. https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising. From this week: - Glite 3.1.0 PPS Update 21 was released to PPS last Friday. No major issues found so far. Update contains: * new VOMS-Admin server (2.0.13-1) and client (2.0.6-1): (Added ACL support to command-line client; 9 bugs fixed. Find yours in https://savannah.cern.ch/patch/index.php?1629) * new vdt_globus_essentials to fix Globus bug 5771: Mainly of interest for CERN-PROD, fixing hanging processes on submission of SAM RB and WMS tests * New version of lcg-tags: warning messages suppressed * DPM 1.6.7-4 32 and 64 bit: SRM v2 and SRMv2.2 new (fixed) behaviour when creating subdirectories with srmMkdir * new glite-AMGA_oracle metapackage The Release notes are here: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update21 - Another discussion on SAM test result accuracy and how this affects the availability calculation result for a site. - Italy reported that SAM was not functioning correctly 14th-16th March - Russia reported a "Critical issue with unauthorized access to disk space via xrootd service. It does not depends on either DPM or dCache. Any person in the world who has an xrootd client can read and write everything. The single action which can not be done - delete files. This point completely violates "The Grid Traceability and Logging Policy" (https://edms.cern.ch/document/428037/). ... this bug is absolutely critical from security point of WLCG/EGEE infrastructure and xrootd service must be stoped until the bug will fixed. See More: https://twiki.cern.ch/twiki/bin/view/LCG/DpmXrootAccess". - FTS transfer-url-copy update for space tokens will be in gLite 3.0 update 41 due out shortly. - CMS (again a long report submitted) are starting the discussion about T2 analysis associations. ROC manager update ************************* - An OLA between GGUS and TPMs has been proposed: http://edms.cern.ch/document/888089 . We need to comment! - There is now a mandate for the EGEEIII Operations Automation Team: https://edms.cern.ch/document/901705. - A GGUS site support survey is about to start - Steve T. is proposing to use GlueSiteObject to increase the useful information published by sites. http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_my_site_information Ticket status *************** See linked document. Other ******* - TPM commitments and how we should organise shifts - Rolling COD activities out beyond the T1
      more information
    • 11:35 11:45
      GridPP20 - some points from the discussions 10m
      Some areas to follow up: -- Random users are using grid tape bandwidth. -- Tools to police users are not in place -- Default WMS settings lead to 3-6 retries -- If documentation was up-to-date there would be less questions -- Some users are too patient (power users vs those with no experience) but this may change when physics results pressing -- Experiments are trying to setup support structures that are global but acting locally -- Would it help to run a meeting focussed on users. Joint sessions between infrastructure (service) providers and experiment users? -- Is there or could there be a "user coordinator"? -- Policy on inefficient jobs - is this now ready to fully implement and was there any feedback? (Policy link: http://www.gridpp.ac.uk/pmb/docs/GridPP-PMB-113-Inefficient_Jobs_v1.0.pdf. Needs to go in the wiki). -- Is there anything else we can do on the misuse of resources issue? Sites are to respond to tickets explaining their position.
    • 11:45 11:55
      Deployment Board matters 10m
      - List of issues raised for follow up -- Follow up on UKQCD requirements (in particular network). Can they use the GridPP infrastructure? -- The killing inefficienct jobs policy was draft until December. It now needs to become a formal policy. Has there been any site feedback? -- An area of concern is capturing error messages and explaining them for new users. Many examples are thought to be found in the Ganga hypernews area. Are there other useful sources that shoult be pooled into a reference tool/document?
      document
    • 11:55 12:05
      Actions review 10m
      http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
    • 12:05 12:10
      AOB 5m