Summary of GDB meeting, January 15, 2014 (CERN)

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=272795

Welcome - M. Jouvin

See slides for future (pre-)GDB planning

  • CNAF Bologna will host the March GDB and probably a pre-GDB on batch systems

Started to think about the next WLCG workshop

  • Early July for next WLCG workshop?
  • Volunteers to host it? Please send an email to Michel and Simone

February pre-GDB will be Ops Coord F2F meeting

VO based SAM tests difficult to prioritise. Timeouts not really errors and should not be counted as failure of the site.

  • SAM test are not planned to use pilot jobs.
  • Proposal to run critical tests with VO's lcgadmin role so sites can prioritise them.
  • Other tests run with normal role but not counted as critical.

Discussion about problem of prioritizing SAM tests:

  • SAM tests: current proposal not seen as a real solution to scheduling issue
    • At leas, 2 major tests requiring a normal/production role: glexec and WN (write access to data)
  • Would be easier if SAM tests could run in pilot framework: to be rediscussed with experts in more details
  • I. Bird: we don't care how tests are submitted, no need to use the SAM machinery, as long as experiments publish results into SAM framework then ok.
  • Need to schedule a dedicated discussion slot at a future GDB with appropriate experts (February?)
    • VOs should send their feedback on the proposal and the discussion today

Accounting Update

CPU Accounting - S. Pullinger

EMI-3 client new features

  • HS06 support
  • Richer data collected, in particular number of nodes/cores
  • Portal will support new EMI3 features later this year (eg summary by submithost)

EMI-2 client: 30 April 2014 is end of security updates

Upgrade instructions

  • Complete rewrite of the client: upgrade from EMI-2 requires more than just package upgrades
  • Upgrade (switch) must be done on a month boundary: no ability to consolidate data collected/published by both versions for a site

Several other clients using APEL SSM2 library: SGAS (NorduGrid), ARC/JURA, QCG, EDGI Desktop Grid

Enforced the data retention policy: user DN only kept 18 months

After end of EGI, only bug fixes are expected for the CPU accounting

  • Development efforts only for storage and cloud accounting

Storage Accounting - J. Gordon

StAR: publishers in EMI-3 release of DPM (1.8.7) and dCache (2.5.2 and 2.6+)

  • Not enabled by default
  • Italy has cut StAR records from BDII to allow to collect data from not supported storage systems (Xrootd, CASTOR, StoRM): a little bit less accurate but much better than nothing
    • Could be extended to other sites that need it
  • All using SSM to publish

Since November approached a few sites to activate publishing of StAR records: a few bugs ironed out

  • Some flaws identified in the logic but this is not a showstopper: working out with developers what can be done

Portal has implemented a similar view to CPU with various selection options

  • Several metrics: allocated resources, used resources...
  • Several views

Future: target the T1s to produce a storage report that could be shown to C-RRB

Simone: why not to use the Italian approach for every storage system?

  • DPM and dCache have numbers of good quality in BDII

Philippe: portal is missing an easy way to produce for a given VO a view of site contribution as a function of time

  • John: should be possible through custom view, discuss offline

Clouds - J. Gordon

Done in the framework of EGI Federated Cloud TF.

  • Cuts a cloud Usage Record from the VMM database
  • SSM used to send to APEL
  • Already supports several cloud MW

Prototype portal available

  • Multiple views possible
  • Aiming for production next Spring

Normalization/Benchmarking: required for comparison between sites...

  • Machine/job features should help to make VM CPU power visible to accounting (known by the VMM, visible from VM)
  • Machine features may include a benchmark component: no need for very precise measurement, any benchmark is better than nothing...

Merging grid and cloud data: should not be a major problem

  • Similar to merging grid and local jobs

Will look at possibility of incorporating data from commercial providers to give a global view of a VO

  • Idea is to develop a bill parser

Alternative to infrastructure accounting (e.g. accounting data from experiment frameworks about user payload rather than pilot jobs): not much progress

GOCDB - J. Gordon

v5.2: a new feature allowing to define arbitrary (custom) key/value pair for sites and use them to select a set of sites

  • Implement site classes?

Worth a presentation at a future GDB

  • First a presentation at the WLCG IS TF
  • Include OSG in the discussion

Volunteer Computing at CERN

LHC@home - N. Hoimyr

Volunteer computing can be used to get access to opportunistic resources (e.g. idle desktops) but also to dedicated clusters without the need to install a complex grid stack

  • Very different availability profiles for the different resources, depending who provide them
  • Can allow to get resources from outside HEP
  • Several Volunteer Computing MW: BOINC, XtremWeb, Condor: in every case, very easy for the contributor to install the agent required to participate

LHC@home: based on BOINC, 2 main projects

  • SixTracks Accelerator Physics: since 2004, 118k volunteers since the beginning, 20k recently
  • Test4Theory: CernVM+VirtualBox, 20k volunteers
    • Require installation of VirtualBox in addition to BOINC client

Recent work: infrastructure consolidation (server failover capabilities), preparing for new apps using VirtualBox

Importance of outreach, user credits and application marketing to get contribution from outside HEP communities

  • Less important for desktop grids using community member desktops

Some sites starting to implement a policy where an idle desktop not participating to BOINC is automatically shut down

What about Protected Application Environment and BOINC/VirtualBox?

  • Means can't run as a service, and always need current user's cooperation
  • Latest version of BOINC and Virtualbox fix this

LHCb Volunteer Computing Experience - F. Stagni

LHCb started to run MC jobs, using the VirtualBox approach

  • Reusing the cloud work
  • Mainly using the community desktops and farms: not attempt yet to get volunteers outside the community
  • Reused the Test4Theory approach
  • No change required to DIRAC

A few issues encountered

  • CVMFS: LHCb rely on it, local cache has to be populated, sometimes only for one job
  • Jobs duration: volunteers can switch off their desktop, need to use short jobs for better efficiency
  • Accounting: based on sites in LHCb, how to account for it
  • BOINC Linux client not completely stable

Conclusion

  • An easy way to run MC jobs everywhere
  • Not completely trivial: need somebody to look at it

Discussion

Simone: why not to add a BOINC client on a WN to backfill the WN when a multi-core jobs is not using all the cores

  • WN idle cycles will be used through the volunteer computing infrastructure
  • Would be interesting to play with this idea and discuss it with the multicore deployment TF in Ops Coord

Worth more discussions in the future: a dedicated meeting (pre-GDB)

Actions in Progress

Ops Coord Report - J. Flix

Next meetings

  • No meeting tomorrow: virtual meeting, just add information to the summary
  • January 30th not yet confirmed

CVMFS

  • Deadline for 2.1.15 set to March 1st
    • Requirement for upgrading Stratum-0
  • Experiments will add CVMFS tests as part of their critical profile soon

SL6 : still a few sites (<5%) not yet upgraded

  • Tickets open, please take care of them

glexec

  • Still 27 sites not ready
  • CMS and LHCb ready to use it
    • ALICE and ATLAS require some developments
  • Test will be made critical this month

perfSONAR

  • Deadline for 3.3.1 is April 1st
  • Dashboard no longer developed/maintained: need to find a new volunteer, a critical piece of the PS infrastructure

SHA-2

  • EOS SRM issue for LHCb instance: fix should be deployed soon
  • Several CAs started to issue SHA-2 certs by defaults
  • VOMS-admin instability fixed: moved back to Java 6
    • JAva 7 issue: waiting for fix, not expected soon
    • A new VOMS-Admin cluster should be available soon

IPv6 validation

  • Successful validation of gLite WMS to a CREAM CE
    • Problem identified in HTCondor to be fixed soon preventing WMS submission to Condor to work

WMS decommissionning progressing

  • Followed up directly with VO still using it

Tracking tools: slow progress in the migration from Savanah to JIRA

  • Not yet a hard deadline at CERN but will happen!

Machine/job features available as a RPM for bare metal and OpenStack

  • Deployed at CERN on WNs
  • Next steps: make it available for OpenStack cloud environment, deliver it as a MW package in AA
  • Waiting feedback from experiments

MW readiness

  • 1st meeting last December (12th): not enough participations from sites and experiments to really start real work
  • Next meeting planned Feb. 6: need more volunteers

Multicore deployment

  • Twiki area created
  • Led by A. Forti and A. Perez-Calero
  • Focused on grid but Cloud WG interaction expected/desirable
  • Several activities occurring at ATLAS and CMS
  • Doodle for a first meeting running: please fill it if you are interested!

OpenSSL and Java7 Issues -M. Litmaath

First issue seen by dCache/DESY on Nov. 18 due to a new requirement in java.security requiring RSA keysize > 512

  • WMS still delegated with 512-bit proxies: bug accepted
  • Workaround is to modify java.security

Second issue seen Dec. 9 with openssl 1.0.1 (last working version 1.0.0-27)

  • Took a long time to understand the real issue... (Jan. 9)
  • Problem happens when both endpoints have been upgraded (TLS 1.2): they refuse to use 512-bit proxies
    • Hardcoded, no workaround

In fact all these 512-bit proxies are coming from gridsite, used by many other components

  • New gridsite version fixing the problem available in EMI-2 and EMI-3 since mid-December
    • Now using 1024-bit proxy
  • Note types requiring the fix
    • WMS (SL5/6)
    • FTS3: CERN pilot instance updated last Friday
    • CREAM: at least EMI-2 on SL6 with the delegation workaround implemented by some sites several sites ago
    • Condor: situation unclear currently (not easy to fully assess)

The problem also affected Globus Toolkit 5.2

  • OSG did an emergency fix
  • EPEL RPM fixing the problems as soon as Dec. 20

Need to stress sites the importance of upgrading the affected the service to avoid the problem to continue

  • More and more sites upgrading to recent versions of openssl

Would be great to start a validation that MW components are ready to use 2048-bit proxies

  • Risk of breaking some components: this is why we stay with 1024-bit proxy for the time being
  • Not easy to do: may be quite time consuming
  • Take advantage of IPv6 testbed?

EGI Operation News - P. Solagna

SHA-2: very few sites left with non-compliant service

  • EGI following up with these sites: thanks to COD and NGIs
  • dCache SRM client available, hope to release it soon in UMD: difficult to check

Availability/Reliability threshold: discussing the possibility to rais by 10% availability and reliability thresholds

  • Availability average increased to 80% (currently 70%)
  • Reliability average increased to 85% (currently 75%)
    • Reliability: from 7 to 5 days of unscheduled downtime
  • Need to fail three months in a row to risk suspension
  • Based on current figures, 15-20 sites will be below target each month
    • Not necessarily the same every month
  • Approval expected at next OMB, end of January

Campaign to start soon for verifying security contacts declared in GOCDB

  • Bi-annual tests
  • Site contacts will have to answer the test email received: will be considered a critical failure followed by NGI
  • If it worked, could be extended to other contacts

Towards a New HEP CPU Benchmark - M. Alef

HS06 has been designed for pledges, procurements and accounting

  • Based on industry standard: SPEC
  • A set of benchmarks: subset of SPEC CPU2006
    • All C++ packages: 4 integer, 3 floating point
    • Proved to match HEP apps performance

HW improvements make necessary to redesign benchmarks

  • New SPEC14 expected at the end of the year
  • Looks as a good candidate for HS14

HS14 requirements

  • Proved matching with HEP apps performance
  • Ease of use: free for academic use, easy to use (known) for vendors
  • Timescale
    • Preliminary kits available to SPEC OSG members: KIT is an OSG associate, will have an early access to it
    • Final benchmark: end of 2014
    • HEP WG in 2014/2015

Proposed steps

  • Identify VO representatives ready to participate to the effort: should we include non WLCG VOs like Belle2
  • Agree on benchmark environment
    • HW platform: 64-bit (was already the case in HS06)
    • Compiler: HS06 used the default gcc compiler
    • Compiler flags: at least switch to 64-bit apps, possibly to higher optimization level (currently O2)
  • Selection of representative HEP apps to compare with benchmark scores
    • Must be CPU bound: benchmark is not measuring I/O

Andrew: HS benchmark should be based on open source SW, like GEANT4

  • Basically almost everybody else find more important to use an industry standards

I. Bird: extension to non-WLCG VOs is more than welcome but should probably remain HEP or very close sciences

  • Belle2
  • Astroparticles like Auger

Agreed next steps

  • Experiment representatives already identified
  • Identification of representatives applications: can start now, without access to the new SPEC14 suite
  • Running of representative apps on existing infrastructures
  • Agreement on new set of compilation flags to use: must match representative apps requirements

EGI Plans for Future Activities - P. Solagna

Cloud developments: Fed Cloud will roll into production in May 2014

Business development activities

  • Exploring pay-per-use model
  • Ability to provision resources from commercial providers
    • Linked with HELIX NEBULA activities

Data: data curation and preservation being evaluated

  • HEP collaboration would be appreciated

Distributed Competenccy Center

  • Help with outreach and engaging with new communities
  • Will require participations from experts from NGIs and VOs

AAI based on federated IdPs

  • Another area for collaboration with WLCG
  • A pilot shoult be started soon with one or more NRENs

H2020 plans: not finalized yet, most topic potentially relevant. 2 main topics on the list:

  • Cloud-related project led by Fed Cloud TF
  • Federated AAI

httpd data access and DAVIX toolkit - A. Devresse

Motivation for data access: classical data access has different requirements from HPC data access

  • Scalability through redirectors
  • Performance: session reuse, low latency
  • Bulk, partial and vector operations
  • File and data management through DAV
    • 3d party copy, checksum, ACLs, staging, quotas
  • Security: support for PKI/X509 auth
  • Reliability: multiple replicas with automatic failovers

Many existing clients, non covering all the required features

  • curl is the most complete

DAVIX goal is to provide a framework delivering/encapsulating all these features

  • Need to be easy to use
  • Independent from grid MW
  • Supported on all platforms
  • Based on an existing http I/O library: libneon
  • An API + a C++ lib with high-level I/O API
  • A set of command line tools for basic operations

DAVIX Status: current version is 0.2.8

  • released on EPEL and Debian
  • API/ABI stable
  • Already used by Dynamic Federations, GFAL2, FTS3
  • 0.3.0 in progress: SOCKS5 support, metalink transparent failover
  • 0.4.0 plans, async I/O, multi-stream metalink, zero-copy architecture

Integration with ROOT underway: TDavixFile

  • Will streamline http support in ROOT
  • Currently available as a patch for ROOT 5.34 and ROOT 6
  • EPEL6 package root-net-davix

Cloud Pre-GDB Summary - M. Jouvin

See slides.

Discussion about conclusions in the slides:

  • Are we staying with CPU time or going to wallclock time?
    • Not final decision but currently staying with CPU time as the base for pledges, reporting ? In fact we are collecting both...
    • CPU time + reasonnable overcommitment seen as the most efficient approach: overcommitment allows to compensate for potentially inefficient use of VMs by VOs
    • Need to get more experience
  • Batch queues needed for VOs that don't run their own queue of tasks
    • No disagreement, we don't say that everybody will get rid of batch system
    • WG is here to explore issues and find possible solution if we want to run a cloud without a batch system in front of it. It was already demonstrated by several sites that a cloud with a batch system in front of it (typically HTCondor) is transparennt for VOs.

Next pre-GDB on this probably in Spring -- MichelJouvin - 15 Jan 2014

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-01-15 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback