US ATLAS Computing Integration and Operations
November 12, 2014
Attending: many
Apologies: Shawn, Jason
ADC TIM 2014 Chicago
- Minutes available
- Remind Simone about action items.
Condor-CE:
- New interest in global ATLAS
- How it appears in the AGIS; SAM tests.
- Need to define the gaps, and other issues identified by Bob,e.g.
- Should discuss at the Jamboree
Tier3 hardware
- End of life hardware
- Idea to augment Tier3's with retiring hardware from Tier2, to be used for end-user analysis
- Creation of a shared Tier3 pool - making these available for US physicists
- 4 year retirement versus 5 years
- Discussion of Tier2 sites opening resources for ATLAS Connect; all Tier2's have indicated a willingness to set this up.
- Accounting discussions should take place in the RAC
- Will need to look at priorities, and enforcement.
Welcome new personnel from NETier2. John Brunelle left to become a Google Engineer in LA. James Cuff has assigned two new people to the group - they'll be joining next week. Congratulations! Thanks to Saul for making it a smooth transition.
Mayuko - LOCALGROUPDISK monitoring
- Slides
- The system manages space based on a facility wide quota per user
- Might be possible to add system to allow users to easily delete.
- Who receives the messages about cleanup decisions?
- Who is responsible for removing datasets? Kaushik: hope that Rucio provides this capability, so users can do this themselves. (At present, Armen has to do it.)
- Not yet ready for users. Kaushik would like RAC approval.
Dave Lesny - Stampede and StratumR
- Slides
- Michael - good idea for other sites. Mike Norman from SDSC offer for ATLAS Connect.
- What edge services are required?
- Other HPC sites? Revisit post SC14
Production - Mark
- Summaries are posted
- Things are quiet.
- Production levels are low; expect fluctuations.
- Now is a good time for downtimes
- 8k jobs at BNL_CLOUD - this is due to a special request. Running MCORE jobs at scale in AWS. Studying scaling issues at various ends. (Most of this not mandatory production)
DDM issues
- Armen: generally things are looking okay
- Rucio migration planned during Thanksgiving week
- Next week expect irregularities
- Saul: all sites should get on the current version of dq2-home, rucio-home; the versions are slightly different. Make sure to use the "latest".
- Dave - the pilot wrapper once called out the ddm setup, Jose would like to change the setup. Horst, Saul - have setup in a different place(s). Leave things as they are now, make the change in one step; just do the urgent change: make sure you do: source /cvmfs/atlas.cern.ch/repo/sw/ddm/latest/setup.sh. Dave: this will always give you the latest.
- Jose is looking for guidance. Saul - will take this up, BU as a test site. Dave will help as well. Horst: should remember to get rid of atlas-wn.
- Michael: we need some consolidation across the sites; make it coherent; then let Jose know how we want to proceed. Consulting with Dave & Horst.
Site reports
- BNL
- AWS scaling test, using BNL SE (setting up AWS SE). FTS3 has added support for S3, will evaluate. ESnet setting up direct connect between ESnet and Amazon. 20,000 cores - identified bottlenecks in the Amazon western region, leading to job losses.
- AGLT2
- MWT2
- NET2
- Purchasing coming soon. Added an OSG queue.
- SWT2
- Patrick: quiet, nothing to report. Horst: nothing to report, all three sites running well
- WT2
- Looking to replace Thumpers and Thors. No discounts coming from Dell. R730 with 24 2.5 inch drives. Higher than list!
AOB
- Bob: User analysis jobs that run more CPU time than wall time. Seems to be related to older version of RootCore jobs, and they are killed. Alden wil circulate to DAST.