Welcome to Harvard - John Huth
====================
Meeting Logistics
US Tier 2 Issues - Jim Shank
==================
LHC Schedule
News for DOE/NSF Review August 15
Workshop Goals
Welcome to two new Tier 2 Sites
2007 is a very agressive schedule for LHC
First Running at injection energy Nov-Dec 2007
Full 14 TeV running in 2008
Discussion of using ganglia for joint monitoring
New Tier 1 ramp up now has a significant reduction in the number of CPUs
New Tier 2 ramp up is essentially unchanged
ATLAS computing timeline is unchanged for Tier 2s.
Resource Allocation Committe (RAC) is now formed with a full set of committee members
Now have the tools to make resource allocations to US ATLAS members
Projected T2 Growth is not fast enough with unspent money.
Need to review T2 ramp up numbers.
Tier 3 Centers
Committee produced a white paper. Much discussion with the funding agencies on funding models.
CSC11 Production - OSG/ATLAS produced 24% of the ATLAS CSC production
All Tier 2 sites should be equal and saturated
Funding Targets May ramp Tier 1 more slowly to meet Tier 2 needs
Will try to get more shortfall from management reserve
Use it or lose it...
Goal of the meeting is fill in goals on Twiki page
Tier 2 documentation - would like to clean up and rationalize web pages
NE Tier 2 Status - Saul Youssef
===================
Boston U and Harvard U collaboration
75% Panda, 25% Local, trace OSG usage
Discussion of random number generation on 64 bit machines.
Worries on using PBS instead of condor
Wait and see approach to dCache vs. GPFS
WAN is 1 Gbps with very fast local fiber ring but connection to NYC is OC48.
Joining new New Net (new part of Abilene) which will provide 2 x 10 Gbps.
Outstanding issues:
1) CPU purchases - what to buy?
2) Hardware implications of new fast network.
3) Storage achitecture
Considerable support for GPFS from Harvard people - much concern on cost of GPFS.
Greg Cross is worried about some problems with GPFS - is able to use many systems.
HyperNews spam incident at BNL - how to be open but not receive spam?
Midwest Tier 2 - Greg Cross
=================
Collaboration between UC and IU
Project update
New order just placed should arrive in about a month.
10 Gbps mostly in place at both sites.
GUMS is working.
Trying decide PBS vs Condor - sites will be "mirrors" of each other.
Storage in on the invidual nodes using dCache.
UC and IU sides will be "mirrored" - a single cluster configuration for ease of management.
Next three months we will deploy our hardware.
Southwest Teir 2 - Mark Sosebee
====================
Collaboration between University of Texas Arlington and Oklahoma.
160 work nodes dual EM64T @ 3.2 GHz on new cluster (with IBRIX) UTA
75 work nodes on older cluster UTA
40 work nodes dual EM64T (with IBRIX) OU
135 nodes on older OSCER cluster (shared)
512 nodes on new OSCER cluster (shared)
UTA link is OC12 to peering with Internet2 in Houston (current- UTA)
Future link is using NLR.
OU is on NLR.
New cluster is online at UTA
Large effort devoted to DQ2
Scaling is a problem with IBRIX when the number of running jobs exceeds 150.
UTA is working on the problems with the IBRIX support team.
Razvan asks about the long term support for IBRIX.
Lots of discussion on IBRIX/storage.
UTA has held to workshops (March and May) to promote physics analysis.
Great Lakes Tier 2 - Shawn McKee
=====================
Michigan and Michigan State
Uses Michigan LambaRail.
MSU has new cluster (mainly for D0) with dual-dual Opterons.
Michigan has older cluster that they are still using. The cluster is installed with ROCKS.
Using Cacti for monitoring.
Using OCSnext generation to make a database of hardware characteristics.
Makes extensive use of AFS to reduce exposure to nfs problems.
Also using dCache.
Still have some services that are not yet up (e.g. DQ2).
There is a prototype cluster of dual-dual Opterons (5 worker nodes + head node).
They want guidance on needs and time scales. It would be simpler if their equipment could be installed in March of 2007.
Both Michigan and MSU have high quality space that will be available next spring.
They are close to providing cycles. They should appear to be a single site to outside users.
Saul asks why using both dCache and AFS? Ans. We have two >100 GB disk on each node that we want to take advantage of.
Their long term plan is to use Condor but some systems are current using PBS/Torque
Western (SLAC) Tier 2 Status - Wei Yang
===============
Western T2 was approved in July.
SLAC has long experience with OSG.
They have successfully run ATLAS jobs via PandaGrid.
Management Team (not charged to ATLAS):
Richard Mount 10%
Chuck Boeheim 15%
Randy Melen 10%
Technical Team (1 ATLAS FTE)
Wei Yang 30-100%
Brooker Bense 20%
Lance Nakata 20%
Scientific and Computing Services 30%
Resources for OSG:
4 SUN V20z
OSG Gatekeeper / Gsiftp
GUMS
VOMS
500 GB Storage
Resources for ATLAS
MySQL replica for CondDB
DQ2 site server / Web proxy for Panda Pilots
500 GB NFS for DQ2
10 job slots in LSF for grid users.
Access to LSF for local ATLAS users.
AFS space for kit and environment
Protype dCache
Leveraged Assets:
~3700 CPU cores in LSF pool (RHEL3 & 4)
~30 CPU cores for interactive nodes RHEL 3 and 4
Challenges:
Grid jobs overloading AFS
Batch nodes do not have internect access (security reasons).
Job Tranformations don't all use CondDB
Security issues with DQ2 and MySQL
Want to use Fair Share
dCache
Using RHEL instead of SL
Need to decide on what hardware to buy
Need to decide storage architecture
Plans
67% funding storage / 33% funding CPU
Large potential for leveraged CPU.
Working on web page
Torre asks about leverage resources. Richard explains how it is possible.
Research Computing at Harvard (relevant to LHC) - John Huth
======================================
Samples of Harvard work:
- Crimson Grid
- Initiative in Innovative Computing
- EGG project
- Another kind of challenge: Inverse mapping problem.
Crimson Grid - Joy Sincar
Local grid using Condor
Collaboration with GLOW (UW) / Miron Livny
Initiative in Innovative Computing - Alyssa Goodman & Tim Clark
Fills the gap between science and computer science
EGG Project - Saul Youssef et al.
Pacman extension
"Market Place" driven resource allocation
LHC Inverse Mapping Problem - N. Arakani-xxxxx and G. Kane
How does real data map back into theoretical parameter space?
Try a systematic approach.
US ATLAS Facilities - Razvan Popescu
=======================
Whats a baseline of what services that we need.
Functions provided by Tier 2 centers:
- Provide computing resources
- flexible means for authentication
- flexible resource allocation
- provide tech support
- provide monitoring and usage accounting
- No interactive work, not direct user support
CPU Resources via grid access
CPU Resourece utilization - allocation policy services (inc. servicee agreement policies).
BNL is using Condor (dropped LSF) - requires one full time FTE for support.
Storage - two types:
1) Flexible high performance (NFS, IBRIX...)
2) Distributed low cost (dCache)
Access control:
- Provide multi-CA authenication
- Provide audit trail
Long discussion on how to provide the audit trail in a grid environment.
Labs can not use group accounts (security).
Monitoring and Accounting
Need central information collection
Tech Support
Need to understand the role of Tier 2s
Capacity Profile
Doing well on CPU and somewhat behind on storage.
Need to make adjustments to the model.
CPU and Storage access.
OSG Gatekeepers, OSG Gridftp, and SRM
CPU Allocation
Obviously use a queuing system but also need Service Agreements and/or Hard/Soft allocations.
Storage and Data Management
Provide "global" FS and low cost
We need quota even in dCache
Authentication and Authorization Services
- GUMS
- VOMS
- dynamic mapping
Monitoring and Accounting
- Ganglia
- Nagios
- MonAlisa
- OSG Accounting
Support
Must define the responsibility chain & call out procedures
Need contacts
RT System into production (RHIC has converted almost completely)
Documentation - must still be written.
An FTS/ToA Model - Dan Shrager
====================
Worked on DQ2 install with help from Horst and Patrick.
DQ2 logical high level - unchanged inplements DDM current policies in ToA
This level defines the data path logic.
DQ2 low level - physical transport
Implemented in FTS and ToA at a low level - intended to eliminate road blocks.
Free access with certificate
FTS Model
Generates a list of serviced end points and NOT bi-directional channels.
One (and only one) rule: any transfer request, regardless of its directions, is serviced as long as one endpoint is affiliated.
ToA Format
The ToA reverse format by report to the FTS direct list.
Each site is tagged with the preferred FTS (eg BNLFTS or NOFTS).
The ftsTopology section is eliminated - as redundent or incomplete.
Features of this data model
- Full load control for each endpoint (as opposed to channel load control)
- Good partitioning at the Tier borders - requests with both endpoints foreign are denied
- Default transfer priority: internal transfers take precidence over ones with one foreign endpoint.
- Simplicity N enpoints vs N^2 channels
- Use of reverse format
- Default criteria fo dataset location can be based on FTS affiliation and/or capability.
US ATLAS Tier 2 configuration
Current FTS/ToA implementation supports free DDM model (advantage you can take data from closest location).
Inter T2 transfers become possible and welcome for better data distribution.
LRC in LFC formatting can be added to a site's DQ2 installation.
European datasets become available on demand
See shell scripted to talk.
A pair of catch-all channels is required in FTS and ToA configuration.
Conclusion about the possible new model
DQ2 implementers - simpler, comphrenive
DDM policy enforcement - site load control
US ATLAS Tier 2 sites - unrestricted access
Physicists - fee access to data.
Kaushik says current design is based on early DQ2 version capabilities.
Dantong points out that fts can have an advantage from knowing the closest servers.
Dantong points out that IU & UC being closely linked should be the first choice for a data choice if the needed file is on the other site.
Dantong what happens if European site wants data on US Tier 2.
Torre there is a very important policy issue here.
Richard Mount either you keep a strict hierarchal Tier 0/1/2/3... model or you need something quite sophisticated.
Razvan is the all routing problems? How is this implemented? How can data be routed on the lowest level? Dan what happens if BNL is down?
Proposal from Alexei:
- Implement Dan's model for a month, and provide feedback at the DDM meeting at BNL.
- Provide allocation controls at the site-level (to manage load), eg., number of simultaneous transfers.
ATLAS Software Infrastructure - Fred Luehring
============================
ATLAS Software Infrastructure - Fred Luehring
============================
Overview of the SIT and responsibilities
US SIT people
- largely european effort, but significant number of
US people involved.
Hot issues of the SIT
- release schedule
- new linux/compiler versions
- new platforms to support
- new support people to cleanup CVS
- geometry/conditions databbase release dist
- kit improvements
Release schedule
- controlled by spmb/cmb, build is done by the SIT
- sep 15, simul with 12.0.3
...
New linux
- cern expects SLC4 by end of Oct
- Sit is running nightlies for SLC4
- slc5 will be too late for lhc turn-on
- slc4 includes gcc345
32 vs 64 bit
- Nightlies are running that test native 64 bit versions
- Goal is to have 64 bit mode validated by end of 2006
- This is for AMD64/EM64T
- no plan to use MAC for production
New SIT people at CERN
- four people with various roles starting now at CERN
- all will contribute to user support
- some will work on documentation
Cleaning up CVS
- either new instance or new directory tree, to include
only active packages
- subversion not accepted now, perhaps in the future
Conditions database versioning
- apparent this is needed in the last few months
- tag collector team looking at ways to tag database versions
Atlas data security
- worried about unauthorized use of atlas data and code
- there is a team investigating this now
- may be implications on grid data storage
Kit Issues
- A kit for every nightly
- To run RTT tests on grid - reduce problems caused by AFS
- One pacman file per project, not package, saving much time
- Pacman mirrors
Installing Atlas offline SW install
- How? Pacman mirrors
- Currently using central CERN cache
- Suggest mirrors with hourly updates via crontab job
for every site
- Issue - how to trace which mirror was used
- Run kitvalidation
More on mirrors
- reduced install time
- multiple releases in same dir tree
Conclusions
- subscribe to hypernews discussions for Releases
and Distribution Kit and Offline Announcements
- report all problems to savannah!!
Panda - Kaushik De
============
Panda Development
- Robust local pilots
- DQ2 enahncements
Local pilot submissions
Xin is working on this.
Robust pilot job submissions
Originally developed / tested at UTA by Patrick
Now in use at BNL
Marco is continuing parallel development of CondorG scheduler
Multi-tasking Pilots (Paul Nilsson)
Primary goal is to add the ability to inject short analysis jobs into CPU that is busy with a long job.
Accounting and Quota Systems
New development by Sudmash Reddy and Torre.
Data Service for non-DQ2 Sites
New work by Karthik, Horst, and Marco
Work is to enable use of sites that do not have DQ2.
Panda Usage Model
Many types of users - Panda must deal with them all.
A number of cases:
- Managed ATLAS Production
- Regionally submitted jobs
- Individual users submitted jobs
- Others???
Discussion on security of the grid. Richard Mount is worried about our high level of visibility making the HEP grids a target.
Screen shot of Panda monitoring.
US Production
Done with Panda
Panda User Monitoring
Display of which users submitted Panda jobs.
Panda Usage Accounting
Tracks who is using what.
Conclusions
Alexei Klimentov has been appointed to DDM coordinator
OSG Status and 0.6.0 - Rob Gardner
======================
Much of talk taken from talks given for OSG/EGEE and WLCG meetings.
OSG Service Stack: NMI, VDT, OSG Release
OSG Service Overview
Compute elements GRAM, GridFTP, GIP
Storage elements SRM-drm, SRM-dCache
Site level services GUMS
VO level services VOMS
VO edge services
Multi-VO
OSG Timeline
See talk for diagram.
ITB 0.5.0 is under development
OSG 0.6.0 is under development
Current OSG Release Description
- VDT 1.3.10
- Privilege infrastructure
- GT4 GridFTP
- GT4 Pre Web Services.
Slide showing contents VDT 1.3.10
Privilege Authorization Services
Site level services to support role-based access to Tier2 resources
Receives updates on mappings from VOMS
Reverse map created for accounting
Authorization Process
GUMS server can be a single point of failure.
ATLAS is steadily consuming OSG resources.
Middleware Release Roadmap
OSG 0.6.0 Fall 2006
Accounting
Squid
SRM V2 + AuthZ
CEMon-ClassAd
Support for MDS-4
Possible requirement to use WS-GRAM
Edge Services Framework
gLexec
OSG 0.8.0 Spring 2007
Just in time scheduling, Pull-Mode Condor-C
Support for sites to run Pilot jobs and/or Glide-ins
OSG 1.0.0 End 2007
Edge Services Framework
Goal is to support deployment of VO services
Based on XEN virtual machines and service images
Site supports a XEN server and VO loads images
Accounting
System is based on probes and collectors
Has publishers for reports and querying
Storage Authorization
Developed at SDSC (high priority for CMS also).
Resource Selection
ATLAS requirements
- Oppourtunistic access to non-ATLAS sites without DQ2
o Uberftp on computing nodes
- Improving DQ2 installation
o glite-*-*
o CGSI_gSOAP
o edg-gridftp-client-1.2.5-1
Patrick worries that dq2 is evolving too fast to have these pieces of software be useful.
Rob points out that it may help other VOs interested in LCG interoperability.
ATLAS wish list?
- Accounting - can OSG service be used by RAC????
o Test with USCMS T2
o Test in ITB
o Compare with Panda collected statistics
o Need to collect and report non-Panda usage of resources
- Managed storage withing SRM/dCache SEs (Right now "wild west storage")
-...
More information on what's coming...
- See ITB release description
- ATLAS BO page in OSG
- Next week's OSG consortium meeting
(for details please see urls in slides)
Dan asks about private network pools. Rob answerst that we can bring this up.
Ultra Light Update - Shawn McKee
=====================
Status Update
1) New kernel
2)
Kernel Developement
Found that a tuned kernel was needed.
Light Path Plans
New light path technologies are emerging.
VINCI being developed by Iosif Legrand.
Shows a demo of dynamic light path building.
LISA, EVO and Endhosts
VRVS is being replaced by EVO (Enabling Virtual Organizations)
Merger of MonAlisa and VRVS
FTS and UltraLight
Work to improve communication between physics people and networking/data people.
Shows a number of slides from Harvey Newman
Please see slides for details - this talk went by very quickly.
TeraPaths Project Team - Dantong Yu
=======================
TeraPaths is complimentary to UltraLight.
Enhanced network functionality with QoS features
QoS Tools
- IntServ
o RSVP
- DiffServ
- ...
TeraPaths Projects
New Development
Going from last mile to end to end QoS.
Using web services
Site Bandwidth Partitioning Scheme
Acquired Experience
- Enabled and tested LAN QoS inside BNL campus network
- Tested and verified MPLS betwen BNL and Michigan
In progress & future work
- Add GUMS
- Develop site-level network resource manager
- Support dynamic bandwidth/routing adjustments
User Support - Dantong Yu
================
Shows current web support scheme.
They are phasing out CTS (home grown) ticket system for RT. Evaluating RT system - so far so good.
There is an US ATLAS Operator on call 9:00 am - Midnight
Shows multi-level call list.
Do calls result in tracking bugs? People should always file bug reports before calling.
Rob asks question about traceability issue with phone calls. What if someone else is interested in the problem?
====================
Day 2 Session Close Out Reports
====================
Facilities - Richard Mount
===============
How quickly will the storage needs rise?
Seems like we may need more storage than the original plan called for.
Need to look at storage ramp up.
There is a lot of monitoring software available that we need to learn about.
What about sites that are configured so that no OSG jobs will run.
We need to support OSG in addition to Panda.
We need to be a green dot on the OSG map.
Network - Rich Carlson
==============
Many milestones from previous meeting but few milestones are met.
Need NDT for all sites. Must be installed from a CD.
Need network diagram that includes all sites.
Implications of connecting sites - what is the infrastructure.
Storage and Data Services - Torre Wenaus
==========================
Primary choice for storage is dCache. Secondary choice is xrootd (given expertise at SLAC). Other groups looking at GPFS and Ibrix. Lustre will likely not be used.
Data Hosted at Tier 2
Full set of AODs will be stored on the Tier 2s.
Each Tier 2 has to:
Each T2 has be able to connect to the T1 using FTS. Will test the 5 T2s with the T1 soon.
[Sidebar about dCache]
Concerns about the feasibility of dCache: scalling & manageability.
Want to tap into the SLAC expertise.
Want to probvde a stable DQ2 service.
Want to get away from site internal path & position information.
We need scheme to manage partitioning of production data areas until srm is available.
Sites should use an alias for the endpoint host.
We need to provide a standard US LRC.
Need to separate LRC DB service from site service DB.
Need US ATLAS standard tools for space management.
Management should be done at US ATLAS management level not at a local site level.
Files have been lost at Tier 2 sites.
Support data access tools for US access to data at LCG sites.
Support US ATLAS security policies and procedures for catalogs and data.
Policy and Accounting Issues - Rob Gardner
==========================
Provide current and project-planned inventory of resources (CPU, storage, )
Site Policy quickly got very detailed when looking at the template.
Provide a single place for policies.
Develop a coherent strategy for accounting.
Develop a RAC friendly accounting portal.
Translate RAC policies into site level implementations.
Define a multi-site quota/shares system (much of this inherent within Panda).
Deploy/implement quota policies as specified by the RAC.
Verify that policies are being implement by Tier2 sites.
See Twiki page for list of milestone dates.
Operations and User Support Fred Luehring
===========================
Grid middleware and ATLAS software version upgrades.
- all sites now at OSG 0.4.1; prepare for OSG 0.6.0.
- Need to test on ITB
- Need clear plan for validation
- Need plenty of warning.
- Understand the SLC4 upgrade
- Installing releases - Xin/Yuri/Tomasz -
- should start using mirrors
- how to synch multiple mirrors?
- Concern or adherence to policy of freezing code
i.e. applying patches without incrementing the
release number
- Need plan for propogating the mirrors after the
Security
- Crypto cards now required at BNL for interactive users
- Security documents needed, eg., security response plan
Ticketing and User Help requests
- probably need three systems
- RT, savannah, and hypernews
- Hypernews - migrate to CERN now.
- Should be a DDM operations group, not handled by individual
shifts at the site
- Panda shifters should categorize the user requests
Monitoring service availability evaluation
- quantify service response
FAQs and information about the Tier2 sites
- try to make common
Support for Tier3s
- example - how to install DQ2 at Tier3 sites
Closeout - Jim
=========
Next meeting:
One day meeting, December 8, warm locale
There are minutes attached to this event.
Show them.