Oct 10 – 14, 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Advances in Grid Computing for the FabrIc for Frontier Experiments Project at Fermilab

Oct 11, 2016, 11:45 AM
15m
GG C2 (San Francisco Mariott Marquis)

GG C2

San Francisco Mariott Marquis

Oral Track 3: Distributed Computing Track 3: Distributed Computing

Speaker

Dr Kenneth Richard Herner (Fermi National Accelerator Laboratory (US))

Description

The FabrIc for Frontier Experiments (FIFE) project is a major initiative within the Fermilab Scientific Computing Division charged with leading the computing model for Fermilab experiments. Work within the FIFE project creates close collaboration between experimenters and computing professionals to serve high-energy physics experiments of differing size, scope, and physics area. The FIFE project has worked to develop common tools for job submission, certificate management, software and reference data distribution through CVMFS repositories, robust data transfer, job monitoring, and databases for project tracking. Since the project's inception the experiments under the FIFE umbrella have significantly matured, and present an increasingly complex list of requirements to service providers. To meet these requirements, the FIFE project has been involved in transitioning the Fermilab General Purpose Grid cluster to support a partitionable slot model, expanding the resources available to experiments via the Open Science Grid, assisting with commissioning dedicated high-throughput computing resources for individual experiments, supporting the efforts of the HEP Cloud projects to provision a variety of back end resources, including public clouds and high performance computers, and developing rapid onboarding procedures for new experiments and collaborations. The larger demands also require enhanced job monitoring tools, which the project has developed using such tools as ElasticSearch and Grafana. FIFE has also closely worked with the Fermilab Scientific Computing Division's Offline Production Operations Support Group (OPOS) in helping experiments manage their large-scale production workflows. This group in turn requires a structured service to facilitate smooth management of experiment requests, which FIFE provides in the form of the Production Operations Management Service (POMS). POMS is designed to track and manage requests from the FIFE experiments to run particular workflows, and support troubleshooting and triage in case of problems. Recently we have started to work on a new certificate management infrastructure called Distributed Computing Access with Federated Identities (DCAFI) that will eliminate our dependence on a specific third-party Certificate Authority service and better accommodate FIFE collaborators without a Fermilab Kerberos account. DCAFI integrates the existing InCommon federated identity infrastructure, CILogon Basic CA, and a MyProxy service using a new general purpose open source tool. We will discuss the general FIFE onboarding strategy, progress in expanding FIFE experiments' presence on the Open Science Grid, new tools for job monitoring, the POMS service, and the DCAFI project. We will also discuss lessons learned from collaborating with the OPOS effort and how they can be applied to improve efficiency in current and future experiment's computational work.

Secondary Keyword (Optional) Computing middleware
Primary Keyword (Mandatory) Computing models
Tertiary Keyword (Optional) Distributed workload management

Primary authors

Anna Mazzacane (FNAL) Dave Dykstra (Fermi National Accelerator Lab. (US)) Gabriele Garzoglio Ms Jeny Teheran (Fermilab) Joseph Boyd (Fermilab) Dr Kenneth Richard Herner (Fermi National Accelerator Laboratory (US)) Kevin Retzke (Fermilab) Marc Mengel (Fermilab) Margaret Votava (Fermilab) Michael Kirby (Fermi National Accelerator Laboratory) Mine Altunay (Fermi National Accelerator Laboratory) Neha Sharma (Fermilab) Tanya Levshina

Presentation materials