Summary of GDB meeting, January 15, 2014 (CERN)
Agenda
https://indico.cern.ch/conferenceDisplay.py?confId=272795
Welcome - M. Jouvin
See slides for future (pre-)GDB planning
- CNAF Bologna will host the March GDB and probably a pre-GDB on batch systems
Started to think about the next WLCG workshop
- Early July for next WLCG workshop?
- Volunteers to host it? Please send an email to Michel and Simone
February pre-GDB will be Ops Coord
F2F meeting
VO based SAM tests difficult to prioritise. Timeouts not really errors and should not be counted as failure of the site.
- SAM test are not planned to use pilot jobs.
- Proposal to run critical tests with VO's lcgadmin role so sites can prioritise them.
- Other tests run with normal role but not counted as critical.
Discussion about problem of prioritizing SAM tests:
- SAM tests: current proposal not seen as a real solution to scheduling issue
- At leas, 2 major tests requiring a normal/production role: glexec and WN (write access to data)
- Would be easier if SAM tests could run in pilot framework: to be rediscussed with experts in more details
- I. Bird: we don't care how tests are submitted, no need to use the SAM machinery, as long as experiments publish results into SAM framework then ok.
- Need to schedule a dedicated discussion slot at a future GDB with appropriate experts (February?)
- VOs should send their feedback on the proposal and the discussion today
Accounting Update
CPU Accounting - S. Pullinger
EMI-3 client new features
- HS06 support
- Richer data collected, in particular number of nodes/cores
- Portal will support new EMI3 features later this year (eg summary by submithost)
EMI-2 client: 30 April 2014 is end of security updates
Upgrade instructions
- Complete rewrite of the client: upgrade from EMI-2 requires more than just package upgrades
- Upgrade (switch) must be done on a month boundary: no ability to consolidate data collected/published by both versions for a site
Several other clients using APEL SSM2 library: SGAS (
NorduGrid), ARC/JURA, QCG, EDGI Desktop Grid
Enforced the data retention policy: user DN only kept 18 months
After end of EGI, only bug fixes are expected for the CPU accounting
- Development efforts only for storage and cloud accounting
Storage Accounting - J. Gordon
StAR: publishers in EMI-3 release of DPM (1.8.7) and dCache (2.5.2 and 2.6+)
- Not enabled by default
- Italy has cut StAR records from BDII to allow to collect data from not supported storage systems (Xrootd, CASTOR, StoRM): a little bit less accurate but much better than nothing
- Could be extended to other sites that need it
- All using SSM to publish
Since November approached a few sites to activate publishing of
StAR records: a few bugs ironed out
- Some flaws identified in the logic but this is not a showstopper: working out with developers what can be done
Portal has implemented a similar view to CPU with various selection options
- Several metrics: allocated resources, used resources...
- Several views
Future: target the T1s to produce a storage report that could be shown to C-RRB
Simone: why not to use the Italian approach for every storage system?
- DPM and dCache have numbers of good quality in BDII
Philippe: portal is missing an easy way to produce for a given VO a view of site contribution as a function of time
- John: should be possible through custom view, discuss offline
Clouds - J. Gordon
Done in the framework of EGI Federated Cloud TF.
- Cuts a cloud Usage Record from the VMM database
- SSM used to send to APEL
- Already supports several cloud MW
Prototype portal available
- Multiple views possible
- Aiming for production next Spring
Normalization/Benchmarking: required for comparison between sites...
- Machine/job features should help to make VM CPU power visible to accounting (known by the VMM, visible from VM)
- Machine features may include a benchmark component: no need for very precise measurement, any benchmark is better than nothing...
Merging grid and cloud data: should not be a major problem
- Similar to merging grid and local jobs
Will look at possibility of incorporating data from commercial providers to give a global view of a VO
- Idea is to develop a bill parser
Alternative to infrastructure accounting (e.g. accounting data from experiment frameworks about user payload rather than pilot jobs): not much progress
GOCDB - J. Gordon
v5.2: a new feature allowing to define arbitrary (custom) key/value pair for sites and use them to select a set of sites
Worth a presentation at a future GDB
- First a presentation at the WLCG IS TF
- Include OSG in the discussion
Volunteer Computing at CERN
Volunteer computing can be used to get access to opportunistic resources (e.g. idle desktops) but also to dedicated clusters without the need to install a complex grid stack
- Very different availability profiles for the different resources, depending who provide them
- Can allow to get resources from outside HEP
- Several Volunteer Computing MW: BOINC, XtremWeb, Condor: in every case, very easy for the contributor to install the agent required to participate
LHC@home: based on BOINC, 2 main projects
- SixTracks Accelerator Physics: since 2004, 118k volunteers since the beginning, 20k recently
- Test4Theory: CernVM+VirtualBox, 20k volunteers
- Require installation of VirtualBox in addition to BOINC client
Recent work: infrastructure consolidation (server failover capabilities), preparing for new apps using
VirtualBox
Importance of outreach, user credits and application marketing to get contribution from outside HEP communities
- Less important for desktop grids using community member desktops
Some sites starting to implement a policy where an idle desktop not participating to BOINC is automatically shut down
What about Protected Application Environment and BOINC/VirtualBox?
- Means can't run as a service, and always need current user's cooperation
- Latest version of BOINC and Virtualbox fix this
LHCb Volunteer Computing Experience - F. Stagni
LHCb started to run MC jobs, using the
VirtualBox approach
- Reusing the cloud work
- Mainly using the community desktops and farms: not attempt yet to get volunteers outside the community
- Reused the Test4Theory approach
- No change required to DIRAC
A few issues encountered
- CVMFS: LHCb rely on it, local cache has to be populated, sometimes only for one job
- Jobs duration: volunteers can switch off their desktop, need to use short jobs for better efficiency
- Accounting: based on sites in LHCb, how to account for it
- BOINC Linux client not completely stable
Conclusion
- An easy way to run MC jobs everywhere
- Not completely trivial: need somebody to look at it
Discussion
Simone: why not to add a BOINC client on a WN to backfill the WN when a multi-core jobs is not using all the cores
- WN idle cycles will be used through the volunteer computing infrastructure
- Would be interesting to play with this idea and discuss it with the multicore deployment TF in Ops Coord
Worth more discussions in the future: a dedicated meeting (pre-GDB)
Actions in Progress
Ops Coord Report - J. Flix
Next meetings
- No meeting tomorrow: virtual meeting, just add information to the summary
- January 30th not yet confirmed
CVMFS
- Deadline for 2.1.15 set to March 1st
- Requirement for upgrading Stratum-0
- Experiments will add CVMFS tests as part of their critical profile soon
SL6 : still a few sites (<5%) not yet upgraded
- Tickets open, please take care of them
glexec
- Still 27 sites not ready
- CMS and LHCb ready to use it
- ALICE and ATLAS require some developments
- Test will be made critical this month
perfSONAR
- Deadline for 3.3.1 is April 1st
- Dashboard no longer developed/maintained: need to find a new volunteer, a critical piece of the PS infrastructure
SHA-2
- EOS SRM issue for LHCb instance: fix should be deployed soon
- Several CAs started to issue SHA-2 certs by defaults
- VOMS-admin instability fixed: moved back to Java 6
- JAva 7 issue: waiting for fix, not expected soon
- A new VOMS-Admin cluster should be available soon
IPv6 validation
- Successful validation of gLite WMS to a CREAM CE
- Problem identified in HTCondor to be fixed soon preventing WMS submission to Condor to work
WMS decommissionning progressing
- Followed up directly with VO still using it
Tracking tools: slow progress in the migration from Savanah to JIRA
- Not yet a hard deadline at CERN but will happen!
Machine/job features available as a RPM for bare metal and
OpenStack
- Deployed at CERN on WNs
- Next steps: make it available for OpenStack cloud environment, deliver it as a MW package in AA
- Waiting feedback from experiments
MW readiness
- 1st meeting last December (12th): not enough participations from sites and experiments to really start real work
- Next meeting planned Feb. 6: need more volunteers
Multicore deployment
- Twiki area created
- Led by A. Forti and A. Perez-Calero
- Focused on grid but Cloud WG interaction expected/desirable
- Several activities occurring at ATLAS and CMS
- Doodle for a first meeting running: please fill it if you are interested!
OpenSSL and Java7 Issues -M. Litmaath
First issue seen by dCache/DESY on Nov. 18 due to a new requirement in java.security requiring RSA keysize > 512
- WMS still delegated with 512-bit proxies: bug accepted
- Workaround is to modify java.security
Second issue seen Dec. 9 with openssl 1.0.1 (last working version 1.0.0-27)
- Took a long time to understand the real issue... (Jan. 9)
- Problem happens when both endpoints have been upgraded (TLS 1.2): they refuse to use 512-bit proxies
In fact all these 512-bit proxies are coming from gridsite, used by many other components
- New gridsite version fixing the problem available in EMI-2 and EMI-3 since mid-December
- Note types requiring the fix
- WMS (SL5/6)
- FTS3: CERN pilot instance updated last Friday
- CREAM: at least EMI-2 on SL6 with the delegation workaround implemented by some sites several sites ago
- Condor: situation unclear currently (not easy to fully assess)
The problem also affected Globus Toolkit 5.2
- OSG did an emergency fix
- EPEL RPM fixing the problems as soon as Dec. 20
Need to stress sites the importance of upgrading the affected the service to avoid the problem to continue
- More and more sites upgrading to recent versions of openssl
Would be great to start a validation that MW components are ready to use 2048-bit proxies
- Risk of breaking some components: this is why we stay with 1024-bit proxy for the time being
- Not easy to do: may be quite time consuming
- Take advantage of IPv6 testbed?
EGI Operation News - P. Solagna
SHA-2: very few sites left with non-compliant service
- EGI following up with these sites: thanks to COD and NGIs
- dCache SRM client available, hope to release it soon in UMD: difficult to check
Availability/Reliability threshold: discussing the possibility to rais by 10% availability and reliability thresholds
- Availability average increased to 80% (currently 70%)
- Reliability average increased to 85% (currently 75%)
- Reliability: from 7 to 5 days of unscheduled downtime
- Need to fail three months in a row to risk suspension
- Based on current figures, 15-20 sites will be below target each month
- Not necessarily the same every month
- Approval expected at next OMB, end of January
Campaign to start soon for verifying security contacts declared in GOCDB
- Bi-annual tests
- Site contacts will have to answer the test email received: will be considered a critical failure followed by NGI
- If it worked, could be extended to other contacts
Towards a New HEP CPU Benchmark - M. Alef
HS06 has been designed for pledges, procurements and accounting
- Based on industry standard: SPEC
- A set of benchmarks: subset of SPEC CPU2006
- All C++ packages: 4 integer, 3 floating point
- Proved to match HEP apps performance
HW improvements make necessary to redesign benchmarks
- New SPEC14 expected at the end of the year
- Looks as a good candidate for HS14
HS14 requirements
- Proved matching with HEP apps performance
- Ease of use: free for academic use, easy to use (known) for vendors
- Timescale
- Preliminary kits available to SPEC OSG members: KIT is an OSG associate, will have an early access to it
- Final benchmark: end of 2014
- HEP WG in 2014/2015
Proposed steps
- Identify VO representatives ready to participate to the effort: should we include non WLCG VOs like Belle2
- Agree on benchmark environment
- HW platform: 64-bit (was already the case in HS06)
- Compiler: HS06 used the default gcc compiler
- Compiler flags: at least switch to 64-bit apps, possibly to higher optimization level (currently O2)
- Selection of representative HEP apps to compare with benchmark scores
- Must be CPU bound: benchmark is not measuring I/O
Andrew: HS benchmark should be based on open source SW, like GEANT4
- Basically almost everybody else find more important to use an industry standards
I. Bird: extension to non-WLCG VOs is more than welcome but should probably remain HEP or very close sciences
- Belle2
- Astroparticles like Auger
Agreed next steps
- Experiment representatives already identified
- Identification of representatives applications: can start now, without access to the new SPEC14 suite
- Running of representative apps on existing infrastructures
- Agreement on new set of compilation flags to use: must match representative apps requirements
EGI Plans for Future Activities - P. Solagna
Cloud developments: Fed Cloud will roll into production in May 2014
Business development activities
- Exploring pay-per-use model
- Ability to provision resources from commercial providers
- Linked with HELIX NEBULA activities
Data: data curation and preservation being evaluated
- HEP collaboration would be appreciated
Distributed Competenccy Center
- Help with outreach and engaging with new communities
- Will require participations from experts from NGIs and VOs
AAI based on federated
IdPs
- Another area for collaboration with WLCG
- A pilot shoult be started soon with one or more NRENs
H2020 plans: not finalized yet, most topic potentially relevant. 2 main topics on the list:
- Cloud-related project led by Fed Cloud TF
- Federated AAI
httpd data access and DAVIX toolkit - A. Devresse
Motivation for data access: classical data access has different requirements from HPC data access
- Scalability through redirectors
- Performance: session reuse, low latency
- Bulk, partial and vector operations
- File and data management through DAV
- 3d party copy, checksum, ACLs, staging, quotas
- Security: support for PKI/X509 auth
- Reliability: multiple replicas with automatic failovers
Many existing clients, non covering all the required features
- curl is the most complete
DAVIX goal is to provide a framework delivering/encapsulating all these features
- Need to be easy to use
- Independent from grid MW
- Supported on all platforms
- Based on an existing http I/O library: libneon
- An API + a C++ lib with high-level I/O API
- A set of command line tools for basic operations
DAVIX Status: current version is 0.2.8
- released on EPEL and Debian
- API/ABI stable
- Already used by Dynamic Federations, GFAL2, FTS3
- 0.3.0 in progress: SOCKS5 support, metalink transparent failover
- 0.4.0 plans, async I/O, multi-stream metalink, zero-copy architecture
Integration with ROOT underway:
TDavixFile
- Will streamline http support in ROOT
- Currently available as a patch for ROOT 5.34 and ROOT 6
- EPEL6 package root-net-davix
Cloud Pre-GDB Summary - M. Jouvin
See slides.
Discussion about conclusions in the slides:
- Are we staying with CPU time or going to wallclock time?
- Not final decision but currently staying with CPU time as the base for pledges, reporting ? In fact we are collecting both...
- CPU time + reasonnable overcommitment seen as the most efficient approach: overcommitment allows to compensate for potentially inefficient use of VMs by VOs
- Need to get more experience
- Batch queues needed for VOs that don't run their own queue of tasks
- No disagreement, we don't say that everybody will get rid of batch system
- WG is here to explore issues and find possible solution if we want to run a cloud without a batch system in front of it. It was already demonstrated by several sites that a cloud with a batch system in front of it (typically HTCondor) is transparennt for VOs.
Next pre-GDB on this probably in Spring
--
MichelJouvin - 15 Jan 2014