LCG Web>WLCGGDBDocs>GDBMeetingNotes20140115 (2014-01-15, MichelJouvin)

EditAttachPDF

Summary of GDB meeting, January 15, 2014 (CERN)

Agenda
Welcome - M. Jouvin
Accounting Update
Volunteer Computing at CERN
Actions in Progress
Towards a New HEP CPU Benchmark - M. Alef
EGI Plans for Future Activities - P. Solagna
httpd data access and DAVIX toolkit - A. Devresse
Cloud Pre-GDB Summary - M. Jouvin

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=272795

Welcome - M. Jouvin

See slides for future (pre-)GDB planning

CNAF Bologna will host the March GDB and probably a pre-GDB on batch systems

Started to think about the next WLCG workshop

Early July for next WLCG workshop?
Volunteers to host it? Please send an email to Michel and Simone

February pre-GDB will be Ops Coord F2F meeting

VO based SAM tests difficult to prioritise. Timeouts not really errors and should not be counted as failure of the site.

SAM test are not planned to use pilot jobs.
Proposal to run critical tests with VO's lcgadmin role so sites can prioritise them.
Other tests run with normal role but not counted as critical.

Discussion about problem of prioritizing SAM tests:

SAM tests: current proposal not seen as a real solution to scheduling issue
- At leas, 2 major tests requiring a normal/production role: glexec and WN (write access to data)
Would be easier if SAM tests could run in pilot framework: to be rediscussed with experts in more details
I. Bird: we don't care how tests are submitted, no need to use the SAM machinery, as long as experiments publish results into SAM framework then ok.
Need to schedule a dedicated discussion slot at a future GDB with appropriate experts (February?)
- VOs should send their feedback on the proposal and the discussion today

Accounting Update

CPU Accounting - S. Pullinger

EMI-3 client new features

HS06 support
Richer data collected, in particular number of nodes/cores
Portal will support new EMI3 features later this year (eg summary by submithost)

EMI-2 client: 30 April 2014 is end of security updates

Upgrade instructions

Complete rewrite of the client: upgrade from EMI-2 requires more than just package upgrades
Upgrade (switch) must be done on a month boundary: no ability to consolidate data collected/published by both versions for a site

Several other clients using APEL SSM2 library: SGAS (NorduGrid), ARC/JURA, QCG, EDGI Desktop Grid

Enforced the data retention policy: user DN only kept 18 months

After end of EGI, only bug fixes are expected for the CPU accounting

Development efforts only for storage and cloud accounting

Storage Accounting - J. Gordon

StAR: publishers in EMI-3 release of DPM (1.8.7) and dCache (2.5.2 and 2.6+)

Not enabled by default
Italy has cut StAR records from BDII to allow to collect data from not supported storage systems (Xrootd, CASTOR, StoRM): a little bit less accurate but much better than nothing
- Could be extended to other sites that need it
All using SSM to publish

Since November approached a few sites to activate publishing of StAR records: a few bugs ironed out

Some flaws identified in the logic but this is not a showstopper: working out with developers what can be done

Portal has implemented a similar view to CPU with various selection options

Several metrics: allocated resources, used resources...
Several views

Future: target the T1s to produce a storage report that could be shown to C-RRB

Simone: why not to use the Italian approach for every storage system?

DPM and dCache have numbers of good quality in BDII

Philippe: portal is missing an easy way to produce for a given VO a view of site contribution as a function of time

John: should be possible through custom view, discuss offline

Clouds - J. Gordon

Done in the framework of EGI Federated Cloud TF.

Cuts a cloud Usage Record from the VMM database
SSM used to send to APEL
Already supports several cloud MW

Prototype portal available

Multiple views possible
Aiming for production next Spring

Normalization/Benchmarking: required for comparison between sites...

Machine/job features should help to make VM CPU power visible to accounting (known by the VMM, visible from VM)
Machine features may include a benchmark component: no need for very precise measurement, any benchmark is better than nothing...

Merging grid and cloud data: should not be a major problem

Similar to merging grid and local jobs

Will look at possibility of incorporating data from commercial providers to give a global view of a VO

Idea is to develop a bill parser

Alternative to infrastructure accounting (e.g. accounting data from experiment frameworks about user payload rather than pilot jobs): not much progress

GOCDB - J. Gordon

v5.2: a new feature allowing to define arbitrary (custom) key/value pair for sites and use them to select a set of sites

Implement site classes?

Worth a presentation at a future GDB

First a presentation at the WLCG IS TF
Include OSG in the discussion

Volunteer Computing at CERN

LHC@home - N. Hoimyr

Volunteer computing can be used to get access to opportunistic resources (e.g. idle desktops) but also to dedicated clusters without the need to install a complex grid stack

Very different availability profiles for the different resources, depending who provide them
Can allow to get resources from outside HEP
Several Volunteer Computing MW: BOINC, XtremWeb, Condor: in every case, very easy for the contributor to install the agent required to participate

LHC@home: based on BOINC, 2 main projects

SixTracks Accelerator Physics: since 2004, 118k volunteers since the beginning, 20k recently
Test4Theory: CernVM+VirtualBox, 20k volunteers
- Require installation of VirtualBox in addition to BOINC client

Recent work: infrastructure consolidation (server failover capabilities), preparing for new apps using VirtualBox

Importance of outreach, user credits and application marketing to get contribution from outside HEP communities

Less important for desktop grids using community member desktops

Some sites starting to implement a policy where an idle desktop not participating to BOINC is automatically shut down

What about Protected Application Environment and BOINC/VirtualBox?

Means can't run as a service, and always need current user's cooperation
Latest version of BOINC and Virtualbox fix this

LHCb Volunteer Computing Experience - F. Stagni

LHCb started to run MC jobs, using the VirtualBox approach

Reusing the cloud work
Mainly using the community desktops and farms: not attempt yet to get volunteers outside the community
Reused the Test4Theory approach
No change required to DIRAC

A few issues encountered

CVMFS: LHCb rely on it, local cache has to be populated, sometimes only for one job
Jobs duration: volunteers can switch off their desktop, need to use short jobs for better efficiency
Accounting: based on sites in LHCb, how to account for it
BOINC Linux client not completely stable

Conclusion

An easy way to run MC jobs everywhere
Not completely trivial: need somebody to look at it

Discussion

Simone: why not to add a BOINC client on a WN to backfill the WN when a multi-core jobs is not using all the cores

WN idle cycles will be used through the volunteer computing infrastructure
Would be interesting to play with this idea and discuss it with the multicore deployment TF in Ops Coord

Worth more discussions in the future: a dedicated meeting (pre-GDB)

Actions in Progress

Ops Coord Report - J. Flix

Next meetings

No meeting tomorrow: virtual meeting, just add information to the summary
January 30th not yet confirmed

CVMFS

Deadline for 2.1.15 set to March 1st
- Requirement for upgrading Stratum-0
Experiments will add CVMFS tests as part of their critical profile soon

SL6 : still a few sites (<5%) not yet upgraded

Tickets open, please take care of them

glexec

Still 27 sites not ready
CMS and LHCb ready to use it
- ALICE and ATLAS require some developments
Test will be made critical this month

perfSONAR

Deadline for 3.3.1 is April 1st
Dashboard no longer developed/maintained: need to find a new volunteer, a critical piece of the PS infrastructure

SHA-2

EOS SRM issue for LHCb instance: fix should be deployed soon
Several CAs started to issue SHA-2 certs by defaults
VOMS-admin instability fixed: moved back to Java 6
- JAva 7 issue: waiting for fix, not expected soon
- A new VOMS-Admin cluster should be available soon

IPv6 validation

Successful validation of gLite WMS to a CREAM CE
- Problem identified in HTCondor to be fixed soon preventing WMS submission to Condor to work

WMS decommissionning progressing

Followed up directly with VO still using it

Tracking tools: slow progress in the migration from Savanah to JIRA

Not yet a hard deadline at CERN but will happen!

Machine/job features available as a RPM for bare metal and OpenStack

Deployed at CERN on WNs
Next steps: make it available for OpenStack cloud environment, deliver it as a MW package in AA
Waiting feedback from experiments

MW readiness

1st meeting last December (12th): not enough participations from sites and experiments to really start real work
Next meeting planned Feb. 6: need more volunteers

Multicore deployment

Twiki area created
Led by A. Forti and A. Perez-Calero
Focused on grid but Cloud WG interaction expected/desirable
Several activities occurring at ATLAS and CMS
Doodle for a first meeting running: please fill it if you are interested!

OpenSSL and Java7 Issues -M. Litmaath

First issue seen by dCache/DESY on Nov. 18 due to a new requirement in java.security requiring RSA keysize > 512

WMS still delegated with 512-bit proxies: bug accepted
Workaround is to modify java.security

Second issue seen Dec. 9 with openssl 1.0.1 (last working version 1.0.0-27)

Took a long time to understand the real issue... (Jan. 9)
Problem happens when both endpoints have been upgraded (TLS 1.2): they refuse to use 512-bit proxies
- Hardcoded, no workaround

In fact all these 512-bit proxies are coming from gridsite, used by many other components

New gridsite version fixing the problem available in EMI-2 and EMI-3 since mid-December
- Now using 1024-bit proxy
Note types requiring the fix
- WMS (SL5/6)
- FTS3: CERN pilot instance updated last Friday
- CREAM: at least EMI-2 on SL6 with the delegation workaround implemented by some sites several sites ago
- Condor: situation unclear currently (not easy to fully assess)

The problem also affected Globus Toolkit 5.2

OSG did an emergency fix
EPEL RPM fixing the problems as soon as Dec. 20

Need to stress sites the importance of upgrading the affected the service to avoid the problem to continue

More and more sites upgrading to recent versions of openssl

Would be great to start a validation that MW components are ready to use 2048-bit proxies

Risk of breaking some components: this is why we stay with 1024-bit proxy for the time being
Not easy to do: may be quite time consuming
Take advantage of IPv6 testbed?

EGI Operation News - P. Solagna

SHA-2: very few sites left with non-compliant service

EGI following up with these sites: thanks to COD and NGIs
dCache SRM client available, hope to release it soon in UMD: difficult to check

Availability/Reliability threshold: discussing the possibility to rais by 10% availability and reliability thresholds

Availability average increased to 80% (currently 70%)
Reliability average increased to 85% (currently 75%)
- Reliability: from 7 to 5 days of unscheduled downtime
Need to fail three months in a row to risk suspension
Based on current figures, 15-20 sites will be below target each month
- Not necessarily the same every month
Approval expected at next OMB, end of January

Campaign to start soon for verifying security contacts declared in GOCDB

Bi-annual tests
Site contacts will have to answer the test email received: will be considered a critical failure followed by NGI
If it worked, could be extended to other contacts

Towards a New HEP CPU Benchmark - M. Alef

HS06 has been designed for pledges, procurements and accounting

Based on industry standard: SPEC
A set of benchmarks: subset of SPEC CPU2006
- All C++ packages: 4 integer, 3 floating point
- Proved to match HEP apps performance

HW improvements make necessary to redesign benchmarks

New SPEC14 expected at the end of the year
Looks as a good candidate for HS14

HS14 requirements

Proved matching with HEP apps performance
Ease of use: free for academic use, easy to use (known) for vendors
Timescale
- Preliminary kits available to SPEC OSG members: KIT is an OSG associate, will have an early access to it
- Final benchmark: end of 2014
- HEP WG in 2014/2015

Proposed steps

Identify VO representatives ready to participate to the effort: should we include non WLCG VOs like Belle2
Agree on benchmark environment
- HW platform: 64-bit (was already the case in HS06)
- Compiler: HS06 used the default gcc compiler
- Compiler flags: at least switch to 64-bit apps, possibly to higher optimization level (currently O2)
Selection of representative HEP apps to compare with benchmark scores
- Must be CPU bound: benchmark is not measuring I/O

Andrew: HS benchmark should be based on open source SW, like GEANT4

Basically almost everybody else find more important to use an industry standards

I. Bird: extension to non-WLCG VOs is more than welcome but should probably remain HEP or very close sciences

Belle2
Astroparticles like Auger

Agreed next steps

Experiment representatives already identified
Identification of representatives applications: can start now, without access to the new SPEC14 suite
Running of representative apps on existing infrastructures
Agreement on new set of compilation flags to use: must match representative apps requirements

EGI Plans for Future Activities - P. Solagna

Cloud developments: Fed Cloud will roll into production in May 2014

Extension of current IaaS with PaaS and SaaS
WLCG involvement in future PoCs?

Business development activities

Exploring pay-per-use model
Ability to provision resources from commercial providers
- Linked with HELIX NEBULA activities

Data: data curation and preservation being evaluated

HEP collaboration would be appreciated

Distributed Competenccy Center

Help with outreach and engaging with new communities
Will require participations from experts from NGIs and VOs

AAI based on federated IdPs

Another area for collaboration with WLCG
A pilot shoult be started soon with one or more NRENs

H2020 plans: not finalized yet, most topic potentially relevant. 2 main topics on the list:

Cloud-related project led by Fed Cloud TF
Federated AAI

httpd data access and DAVIX toolkit - A. Devresse

Motivation for data access: classical data access has different requirements from HPC data access

Scalability through redirectors
Performance: session reuse, low latency
Bulk, partial and vector operations
File and data management through DAV
- 3d party copy, checksum, ACLs, staging, quotas
Security: support for PKI/X509 auth
Reliability: multiple replicas with automatic failovers

Many existing clients, non covering all the required features

curl is the most complete

DAVIX goal is to provide a framework delivering/encapsulating all these features

Need to be easy to use
Independent from grid MW
Supported on all platforms
Based on an existing http I/O library: libneon
An API + a C++ lib with high-level I/O API
A set of command line tools for basic operations

DAVIX Status: current version is 0.2.8

released on EPEL and Debian
API/ABI stable
Already used by Dynamic Federations, GFAL2, FTS3
0.3.0 in progress: SOCKS5 support, metalink transparent failover
0.4.0 plans, async I/O, multi-stream metalink, zero-copy architecture

Integration with ROOT underway: TDavixFile

Will streamline http support in ROOT
Currently available as a patch for ROOT 5.34 and ROOT 6
EPEL6 package root-net-davix

Cloud Pre-GDB Summary - M. Jouvin

See slides.

Discussion about conclusions in the slides:

Are we staying with CPU time or going to wallclock time?
- Not final decision but currently staying with CPU time as the base for pledges, reporting ? In fact we are collecting both...
- CPU time + reasonnable overcommitment seen as the most efficient approach: overcommitment allows to compensate for potentially inefficient use of VMs by VOs
- Need to get more experience
Batch queues needed for VOs that don't run their own queue of tasks
- No disagreement, we don't say that everybody will get rid of batch system
- WG is here to explore issues and find possible solution if we want to run a cloud without a batch system in front of it. It was already demonstrated by several sites that a cloud with a batch system in front of it (typically HTCondor) is transparennt for VOs.

Next pre-GDB on this probably in Spring -- MichelJouvin - 15 Jan 2014

Topic revision: r1 - 2014-01-15 - MichelJouvin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback