- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
Details and notes of the ATLAS Sites Jamboree, March 2018
This ATLAS Sites Jamboree is part of the ATLAS SW&C technical week
And on Thursday there is the ADC&SW Joint session
>>> Compute & Support Session Minutes
> DPA Overview - Ivan
> Site Support - Xin
>> Baseline requirements
- Storage - SRM, discless
>> Other services
- Computing - 25/75 - Analy/Production or UCORE
- Migration to CentOS
>> Trends
- Saving manpower for sites
- Pilot improvements
- Go UCORE
- Go Diskless
- Go Containers
- Go Standard software
- Still the flexibility should be kept - One Box Challenge, LSM, hybrid staging model
>> AGIS
- Get rid of obsolete data
- Keep the documentation up-to-date
>> Site support
- Go to ADC Meetings
- Follow DPA mailing list
- Response to GGUS
- Contact your cloud support list atlas-add-cloud-XX
- Site Documentation - FAQ TWiki started
> Site Monitoring - Ryu
>> Availability & Reliability definition
>> Probes
- SE and CE Services
>> Monitoring
- WLCG dashboard
- MONIT in preparation
- Monthly reports are available
- Re-computation can be requested
>> Recent Changes
- etf_default in AGIS to select which queue to be used
>> Future Improvements
>> HS06
- Data Selection
- CPU Ranking
- HS06 vs events/sec per site - there are sights which are off from the values predicted bt HS06 in AGIS
> Summary
> EventService for Sites - Wen
>> Events produces - peak - 33M/day
>> Events produced per site
>> Example of ES using resources - BNL_LOCAL, CONNECT*
>> Storage - much more stable
>> EventService
- Storage - fixed. Now focusing on es_merga_failed errors.
- HammerClout blacklists files
>> ES Commissioning
>> ES brokerage/priority/share when scaling
>> ES sites when scaling
> Site Description - AleDS
>> Strategy
- Using GLUE 2 or info from jobs
- Plugins to different batch system to fetch info
- Initial prototype - ready
- Supported - LSF, PBS, HTCondor
- Info already available via kibana. Complete view of the nodes - available.
>> Checking of data
- LSF is matching with local monitoring data
- PBS - matching too
- Not possible to derive all total number of slots
- Sites do not allow usage of qtat
- HTCondor - matching too. One can reconstruct teetotal number of possible slots.
>> What we get:
- number of slots, nodes shared among different PQs..
- Some wrong configurations in AGIS. Ex: Set PBS, used UGE
>> Conclusions
- Next steps
> Workflow Analysis - Thomas
>> Introduction
- Grid, finished jobs Nov 2017 - Jan 2018 from UC Kibana. Separation by resource type. TWiki available.
>> Sum of Walltime x Cores
>> Processingtype
>> Walltime x Cores
- User jobs are usually very short
>> Execution Time x scores/ n_events
>> CPU Efficiency per Core
>> Max PSS per Core
>> I/O Intensity
>> Stagein Time & Inputfile Size
>> Stageout Time & Outputfile Size
>> Kibana Dashboards & Jupiter are available
> Unified Panda Queue - Rod
>> Definition
- one queue per physical resource
>> Motivation
- Gshare and priorities should decide what job starts
- Run order based on priority only
- Evgen currently is manually capped
- Incorporate Analysis
>> Analysis
- We should be running more according the gshare. This will be solved by UCORE
>> Challenges
- Needs unpartitioned cluster
- Changes in monitoring
>> Brokerage & Monitoring
- Not exposed in bigpanda
- HC PFT
>> Deployment
- Initially only possible via aCT
- Now also Cream CE
- Possible to submit to ANALY UCORE ANALY MCORE or HIMEM
- jobs on a UCORE site are following the gshare exactly
>> Need to limit jobs
- We need limits on the physical properties of the jobs
- Add perimeter per sum of running jobs?
>> Conclusions
> Payload Scheduler - Fernando & Tadashi
>> Central control of resources for UCORE
>> Dimensions to be controlled
- Global shares, WQ alignment, local shares, site capabilities, data distribution
>> Global Share Definition - reminder
>> Panda flow scheme
- Global shares are take into account in the last level - job dispatching
- For each share n_queued = 2*n_running
>> Alignment with JEDI WQ
- There is a work queue per share per type (mcore, score..)
- There are ad-hoc WQ for big resources for example
>> Global shares scheme
>> Global shares in BigPanda
>> Changing shares UI in ProdSys.
- Adding is still developer’s action
>> Local partitioning
- No way to say if mcore / score partitioning is dynamic
>> Local partitioning ANALY vs PROD
>> Global Shares Summary
- Working well. We already have operation experience
- Things to change in order for the shares to work perfectly
>>> Harvester
>> Motivation
- Pilot submission to all resources
- Pilot stream control
- Clear separation of functionalities with pilot (Pilot 2)
- Resource discovery & monitoring
>> Design
- Everything fully configurable / pluggable
- Push & Pull modes
- HPC mode
- Runs on edge node - light and fulfilling edge node restrictions
- WN communication via the shared FS
- Grid/cloud mode
>> Architecture
>> Status - HPC
- In use in most of the US HPCs
>> Status - GRID
- HTCondor plug in
- Scalability tests are ongoing. How many jobs can be run simultaneously? Can it run on all site?
- Running on 200 PDs, a few jobs per PQ.
>> First Pilot Streaming commissioning
>> Status - Cloud
- Mixed mode: with cloud scheduler
- Harvester only - Harvester is handling the VMs itself
- Trying condor-let approach
>> Status - Monitoring
- First implementation - ready
>> Plans
> Blacklisting Overview - Jarka
>> How to receive PFT/AFT HC tests
- New: HC tests UCORE
- New: Testing can start within 1 hour
>> HC Blacklisting
- New: AGIS controller in production
>> How to:
- Set up PQ status to ONLINE/TEST/BROKEROFF/OFFLINE
- Set up PQ status to AUTO i.e. take over by HC
- Adding a new default queue
- Adding new PQ
- Decomission a PQ