Indico celebrates its 20th anniversary! Check our blog post for more information!

NGS-GridPP joint meeting

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- The meeting will take place in Oxford. OeSC will provide the room - for details of how to get there and accommodation see http://www.physics.ox.ac.uk/it/GridPP-NGS%20Meeting.htm. GridPP will fund GridPP participants and NGS those from NGS! Let Jeremy know if you do not have clear funding to attend. - This workshop is intended to clarify interoperability questions and issues between NGS and GridPP teams in order to map out steps towards an NGI.
NGS-GridPP meeting Wednesday 22nd October 2008 ************************* These meeting notes record questions raised and answers given during the meeting, plus any explicit actions agreed. Present: ****** Steven Young David Wallom Stephen Burke Yves Coppens Pete Gronbech Derek Ross Robin Middleton Alessandra Forti Mike Jones Andy Richards Matt Viljoen Claire Devereux Thierry Delaitre Shiv Kaushal Kevin Haines Jens Jensen Jason Lander (am – Access Grid) Andrew Elwell Mingchao Ma Keir Hawker Jeremy Coles Actions/follow up *************** 1) Share experiences when it makes sense (fabric). Circulate meeting details. [Jens] 2) Share SAM history plot link [Jeremy] The link is http://pprc.qmul.ac.uk/~lloyd/gridpp/samplots.html. 3) Resolve NGS double name publishing issue. It was thought that “ngs” could be bolted on the front of the existing names. [Steven/David] 4) Clarify RUS usage and plans on EGEE side. [Matt/Jeremy] 5) Understand how sites supporting GridPP/EGEE and NGS will get ticketed. They should really have one helpdesk system to use/learn! [Matt/Jeremy/Andy] 6) Follow up on email interface actions for GGUS response. Do all sysadmins need to login to set solved status flag? [Jeremy] 7) NGS ops to clarify incident response list and usage [Steven/Andy] 8) Mingchao to agree security dissemination and response process for NGS and especially sites that sit in both NGS and GridPP. [Mingchao] 9) Clarify what sites are signing up to with the EGEE SLD. [Jeremy] 10) List the exact differences in the NGS and GridPP approaches and document this within 5? weeks [Steven, Pete, David & Yves] 11) Update UKI-ROC technical mailing list to be used for internal ROC-GridPP and ROC-NGS communications. [Jeremy] Some ways to progress: A) Can we get jobs running on NGS B) Include other GridPP sites that are not affiliated. C) Get more NGS people involved. D) Prepare document to map GridPP to NGS. Equivalent of accounting. Get one site in NGS core taking jobs from GridPP WMS. Technical discussion possible at next NGS technical board. E) Clarify commitments on joining EGEE F) Charge at point of use concept needs further thought G) NGS could approve some EGEE VOs? Tier-2 regional VOs? H) Alternative – a lot of users using GT4/web services. Not difficult to install. Questions raised during middleware talks onwards. Middleware – GridPP/EGEE gLite **************************** Are there APIs available for users who do not use command line? Yes Ganga. WMS has web services. Majority of WMS are direct submission Grid-SAM. Is CREAM that necessary? Replace with batch system. BES is not stable yet. It is a standard. CREAM is also BES compliant. Need to be careful to not end up with BES submitting to BES! Slight hurdle for affiliate sites – lack of SGM accounts. How is the 64-bit thing coming along? Can still run 64-bit kernel with 32-bit compatibility libs. Many sites run 64-bit SEs for example. Any work with virtualisation going on? Worked on but not turning into anything real yet. But many services run as virtual machines but that’s not what you’re getting at. What about standards? Using ETICS package now for builds. Code is being made less platform dependent. There is LSB that has been around for 3 years. UMD? Universal middleware Will there still be multiple WMS between GridPP and NGS Could use the same one. Depends mostly on the load. Can ngs VO submit through your WMS? They could but at the moment only inca tests are being run anyway. GridPP clusters also report to NGS broker. NGS BDII aggregates published information. How well does the CE deal with cluster being submitted to locally? The NGS nodes are all submitted to through multiple mechanisms. You can/should separate the batch system from the CE. Sites have different local user access. Some force entry through grid but others allow direct submission. What will WMS do – talk to both CREAM and LCG-CEs? GIN workshop OGF talk on this topic. Will talk to both and other compliant components (BES). Command line interface will not be with us forever – xml is machine readable. Glue itself can be represented as xml but we do not believe gLite will go that route. STORAGE – SRM, SRB and iRODS (Jens) *********************************** On SRM: How many user client instances of the SRM? You have 6 versions but still a standard? Yes. Different implementations of the standard. On SRB: How often do the SRB get updated since versions don’t interoperate? Not so often now, every month or two at start. Also note that SRB is used by libraries – used by fedora. 25 instances using SRB in this repository environment. SRM and SRB have interoperated. Can transfer files between the two. Lot in common at the fabric level. (Action) Share experiences when it makes sense (fabric). Middleware – one size fits all. Most of the instances have client tools. Case study – how would a user use the SRM? Register file – copy to SE. Will find SE. To put file in it talks to SE/SRM. Client does data transfer. Performance good because transfer load is distributed. Transfer into any of the disk nodes. Finished. Client registers in catalogue. Similar to SRB? Well, more modular. More happens behind scenes with SRM. Can also talk to SRM directly. Once in SRM how do you access files? Job knows what is wanted. Decide send job to where data is located. Copy file out of SRM or access via local access protocol. Lot of data management. How do you manage meta-data? This is not stored in the SRM. There are databases out there. gLite has metadata catalogue called AMGA but HEP have their own solutions. Many users not technically willing to get involved at this level. Lots of data but want tools to do it. Are the user tools available? High level data management is missing. Biomed have friendly portals. Glue and storage accounting (Stephen) There is an EGEE storage accounting service – takes snapshots. How do you/we charge for these things? How many users have written clients to access storage directly? Have not heard from GridPP about support for windows. LHCb make use of windows. There is work/interest for this in the linear collider groups. No plan for SRM on windows. NGS software stack (David) *********************** Attach a lot of importance to gsissh access to node. Only recently got an RB. VDT tools – difficult for people to install on their desktop. Primary purpose is to allow those who need to precompile to login. Binaries all available in one place in GridPP? Need some interactive based login – T3 like? How does someone get binaries around grid? If production, SGM puts it around the grid. Ordinary job sent to use officially tagged software. Own software, in sandbox otherwise wget. That requires a homogeneous environment. Submit a compile and run job example? Analysis code you might want to fiddle and resubmit. People want to build and test and then submit. MPI start has been designed to allow… What you describe is not like a grid. Being a single homogeneous system is not a grid. You have a standard environment variable that tells you where the software is located. You cannot tailor jobs to 300 sites like in EGEE. There are 24 sites in NGS at the moment. Applications that are designed to be interactive are difficult to “gridify”. Monitoring & accounting (GridPP/EGEE) – (Andrew) ********************************************* What is the scope of torque? How does this work with Condor? Had problems but working at the moment. Does APEL still using R-GMA as the transport? Yes. But looking to move to ActiveMQ Are SAM trends captured? Yes, can login but also GridPP captures graphs such as can be reached through links on this page: http://pprc.qmul.ac.uk/~lloyd/gridpp/samplots.html. Double publish. One VO doesn’t use same software as another. Still need to address the namespace issue. [to be decided in NGS]. Should propose an ngs name change across sites. Action - Can bolt ngs on front of name. Name change needs to be decided/agreed. For accounting – what are the plans for standards. RUS? What progress? NGS monitoring & accounting (Shiv) ****************************** Federated Ganglia used – also in GridPP? Used to do this but not now. Overlap of tests between INCA and SAM? Do site admins watch the monitoring & what plans if any to alarm tests at site (Nagios)? Spec benchmarks? If over quota user removed from VO -> gridmap file. Are users able to see their accounting data? What data does APEL upload just the summary? No, full information. Users (Alessandra) **************** Were the Vos existing before the resources? Yes, follow structure of physics collaborations. Different for biomed. What do you mean by smaller VOs cause more service disruption? Unintentional filling of /tmp space with log files would be an example. If they don’t install software how do they work? They copy exe across. Send job. Do VOs use commercially licenced software? Few – e..g linear collider group. Do the users target resources? Tend to run everywhere unless local university users. But they still use the WMS? Yes. What rpms are stored in YUM and apt-get repos? Mainly used by the nagios group at the moment. SVN for their code. Also dCache SRM users. Is it mediated by gridsite? Yes Is google chat more than just jabber? Jabber. Group chat is useful too. Do you add user tickets if they speak to you in the corridor? Yes if appropriate. Do you have GOCDB entries for higher level services? Yes. Plug in nodes to the list. More should be automated. You also for example set the information system for CE – set to draining and the WMS will stop job submissions. Slight problem that direct user submissions bypass the WMS! NGS recently adopted policy – member of VO must authenticate using VOMS credential. Do they use VOMS proxy init. Globus gridmap file still used so may still map you. Made policy since no way to allow accounting. Multiple user account mapping. With pool accounts… if ATLAS and something else? Gets mapped to linear collider is using VOMS credential of that VO. Fall back is to use unix group if no VOMS credential? Yes… those with multiple mappings learn what to do. WMS will require VOMS soon but not all services use VOMS. Support-GGUS (Stephen) ********************* Who decides if it is a VO or infrastructure problem? User can direct or TPM. Can tickets be assigned to more than one unit? No, but you can split a ticket. For security ticket – don’t use GGUS to handle tickets. Incidents done off list. GGUS now better than 2-3 years ago. Then moved away from interface – didn’t handle attachments etc. Footprints – forked because GGUS end kept moving so held local copy of the schema. Two closed states. Could fix the importer if people wanted it, but can continue using GGUS. GridPP sites joined NGS now come back to using Footprints – force them to have agents. From Oxford side, since no ngs jobs only seen one side of it. NGS footprints service far from perfect. (Action/Issue): Understand how sites supporting GridPP/EGEE and NGS get ticketed. Should really have one helpdesk system to learn! EGI/NGI world. NGS able to grab reports out of footprints, does GGUS provide anything? Basic reports but more functionality has been requested. Some daily problems seen: One ticket – assigned to more than one site but should have been split. 2 tickets a day is at the UK level through GGUS. On question of reporting – do you have breakdown of what the tickets are about? GGUS has a breakdown by support unit assignments but not finer grained. Watching tickets – has it gone to in progress. Reply to email if does not go to in progress. Sysadmins as supporters. Can look with cert but need to register to change. Action – follow up on email interface actions for GGUS response. Do all sysadmins need to login to set solved status flag? Waiting for reply seems to now be automated. Applications activity in EGEE – NA4 (Abdeslem) How many people? Three half FTEs. Security (Mingchao) **************** Need mutual communication. GOCDB – populates some entries. Dedicated mailing list used by Mingchao!? NGS has two mailing lists – incident and discussion. Does this need a tidy up? No dedicated security officer – new person needs to understand clearly. What if a GridPP specific issue? Action: NGS ops to clarify incident response list and usage Action: Mingchao to agree dissemination process. What about sites which are GridPP and NGS affiliates? Have do they respond, to both? NGS policy based on OSG policies. Also the contact database will be moved to be the GOCDB names. NGS security (Jens) **************** Do you include research and commercials? [Scope shibb] Could let anyone in with IDP. SARoNGS to work as anonymous user. Pressure on federation about release of attributes. Can’t even find out internally who people are. Is it the CA who makes the decision not the TAG? Yes. Most incidents at the OS level therefore common in principal to GridPP and NGS. Will shib affect delegation? It is authentication. Miscellaneous topics (various) ************************* Databases (Keir) What interface are you presenting? Users SGL+, java, gsissh, applications, ODBC. Also OGSA-DAI. Are GridPP databases used by end users or something else? Mainly VO wide application based. GOCDB is also automated INCA connects to GOCDB. Specific ports? Yes…. New server has ssl, application express allows a web-page connect. What about the conditions database? Users… DBAs have to allow. NGS counts users in different way. NGS may have users connecting from anywhere. Before Oracle 11 could connect with browser but not afterwards. COD (Derek) Another question ROC assigned to UKI ROC dealing with tickets. gLite on NGS? (Yves) Do you need SL4 on workers? Yes at the moment. Problem in BHAM since non-GridPP facility. How have you handled cluster integrators? If replace versions on nodes then breach maintenance contracts. People would have to be aware of this… other sites may not have this in their procurement. This is problem of gLite. Should support several. If NGS site did this… interested in joining but what are we signing up to? SLD question. Action: Jeremy to clarify what sites are signing up to with the EGEE SLD. Maui adjustment for fairshare. Worried opening access like this would allow many many jobs to arrive. What effect does using this separate submit box have? Need adequate scheduling in Maui – need to be careful. Increased load on sysadmins would be a concern. (CE MON box). After 31st March… no core sites. Sites would want to recover costs from VOs. Overlapping of pool accounts. Don’t want to interfere. Why can’t we deploy this sub-stack… it is only part of the stack! Why not existing RB submitting to NGS? Globus, glue etc. all there… you propose another CE. What does MON do? Mainly accounting. Rather than pull accounting out of standard RUS service… rather than whole new host. Interface with APEL. Glue has to be BDII… worry is that this is technically true, software never tested submitting to standard Globus. Bits we don’t do… all the VO support – and the application stuff as in VO software areas. LCG pbs allocates specific to queue. Need to explicitly list the differences. Action: Stephen – Pete – David and Yves, to list the exact differences in the NGS and GridPP approach and document this within X weeks. Good to have cut down CE but stuff do we really need it? Just run VO software bits and MON perhaps way forward. RYSNC batch system logs just use MON. Wrapper… Sounds more and more like aligned internally. RUS already reads from gatekeeper and batch system. Could quite easily have this publishing in format needed. Ignoring gLite WN software installed? No that’s easy. Running cluster vision nodes on SL4. The jobs need … nodes connected to internet connection. Only need outgoing access. BHAM uses iptables. Are the backend nodes expecting to use SRM? What do you mean process accounting? Why, batch system caches usage… it is security…. Some ways to progress: 1) Can we get jobs running on NGS 2) Include other GridPP sites that are not affiliated. 3) Get ngs people involved. 4) Document to map GridPP to NGS. Equivalent of accounting. Get one site in NGS core taking from GridPP WMS. Technical discussion possible at next NGS board. 5) Clarify commitments 6) Charge at point of use concept. 7) NGS approving EGEE VOs? Tier-2 regional VOs? 8) Alternative – a lot of users using GT4/web services. Not difficult to install. How do we stay in contact? Mailing list!!! Not very welcome. Perhaps informal names list in follow up to meeting mail. Point of friction has been ticket handling – can this be improved? Action – revisit UKI-technical mailing list for join ops. On the issue of an NGI… GridPP way is very specific. Also need to factor in the EGEE part… some expect “changes” come November? Two get used interchangeably. Missed topics (for future meeting?): - Need to think about what user communities need and how they will be encouraged/helped. - Training. On the user side - Regionalisation of the GOCDB. - MPI. How is it being used. Meeting ended at 17:40.
There are minutes attached to this event. Show them.
    • 09:00 09:30
      ARRIVAL (Tea, coffee and pastries) 30m
    • 09:30 09:40
      Introduction 10m
      - Overview of the day - Objectives of the meeting
      What's it all about?
    • 09:40 10:00
      Current GridPP deployment and operations [Jeremy Coles] 20m
      - Organisational context - Drivers and concerns
      GridPP-EGEE ops slides
    • 10:00 10:20
      Current NGS deployment and operations [Andy Richards] 20m
      - Organisational context - Core aims and drivers
      Slides
    • 10:20 10:40
      Middleware: GridPP/EGEE gLite [Pete Gronbech] 20m
      - Overview of components - gLite interfaces - Submission mechanism - Information system and Glue 2 [Stephen Burke]
    • 10:40 11:00
      SRM, SRB and iRODS [Jens Jensen] 20m
      - Summary of the storage approaches - Common areas - Underlying requirements - Interoperability
      Slides
    • 11:00 11:20
      BREAK (Tea/Coffee) 20m
    • 11:20 11:40
      Middleware: NGS software stack [David Wallom] 20m
      Slides
    • 11:40 11:55
      Discussion 15m
      - Overlaps - Clarification on service requirements
    • 11:55 12:10
      Monitoring & accounting: GridPP/EGEE [Andrew Elwell] 15m
      - SAM framework - Site reports and availability - Plans for Nagios and alarms - APEL accounting - Storage accounting
      Slides
    • 12:10 12:25
      Monitoring & Accounting: NGS [Shiv Kaushal] 15m
      - Overview of monitoring on NGS - Inca monitoring framework - User Accounting System - Resource Usage Service - Future Plans
      Slides
    • 12:25 12:40
      Discussion 15m
      - Areas of overlap - Adoption of technologies
    • 12:40 13:30
      LUNCH (Buffet) 50m
    • 13:30 13:50
      Users & support: WLCG/EGEE [Alessandra Forti] 20m
      - User communities - Documentation (use of wikis, blogs, lists...)
      Slides
    • 13:50 14:10
      Users & support: NGS [Matt Viljoen] 20m
      - Types of users - Special requirements - Documentation (including wikis, blogs, lists...)
      Slides
    • 14:10 14:30
      Helpdesk [Stephen Burke] 20m
      - Use of GGUS - Function of TPMs - Needs and expectations (ROC assignments) - Ticket followup (who does what and when)
      Slides
    • 14:30 14:50
      Discussion 20m
      - Regional participation, applications and the NGI [Abdeslem Djaoui] - Overlapping communities - Changing user needs and services - Engaging new user groups - Outreach and support
      Slides
    • 14:50 15:10
      BREAK (Tea/Coffee) 20m
    • 15:10 15:30
      Security: ROC & GridPP [Mingchao Ma] 20m
      - Incident response - Communications
      Slides
    • 15:30 15:50
      Security - NGS/NGI [Jens Jensen] 20m
      - CA TAG - Future authentication methods - Incident response [??]
      Slides
    • 15:50 16:20
      Miscellaneous topics 30m
      - MPI - Shared resource centres (and campus grids?) [Yves Coppens] - COD activities and move to regional COD [Derek Ross] - Training - Databases [Keir Hawker] - Sharing topology information - regional GOCDB
      Databases
      gLite and the NGS
      Slides
    • 16:20 16:40
      What next? 20m
      - review of actions & discussion outcomes - planning?