US ATLAS Tier2 Workshop at Harvard
, , ,
Follow-up meeting to the US ATLAS Tier2 planning meeting in Chicago, May 9-10, 2006.
Welcome to Harvard - John Huth ==================== Meeting Logistics US Tier 2 Issues - Jim Shank ================== LHC Schedule News for DOE/NSF Review August 15 Workshop Goals Welcome to two new Tier 2 Sites 2007 is a very agressive schedule for LHC First Running at injection energy Nov-Dec 2007 Full 14 TeV running in 2008 Discussion of using ganglia for joint monitoring New Tier 1 ramp up now has a significant reduction in the number of CPUs New Tier 2 ramp up is essentially unchanged ATLAS computing timeline is unchanged for Tier 2s. Resource Allocation Committe (RAC) is now formed with a full set of committee members Now have the tools to make resource allocations to US ATLAS members Projected T2 Growth is not fast enough with unspent money. Need to review T2 ramp up numbers. Tier 3 Centers Committee produced a white paper. Much discussion with the funding agencies on funding models. CSC11 Production - OSG/ATLAS produced 24% of the ATLAS CSC production All Tier 2 sites should be equal and saturated Funding Targets May ramp Tier 1 more slowly to meet Tier 2 needs Will try to get more shortfall from management reserve Use it or lose it... Goal of the meeting is fill in goals on Twiki page Tier 2 documentation - would like to clean up and rationalize web pages NE Tier 2 Status - Saul Youssef =================== Boston U and Harvard U collaboration 75% Panda, 25% Local, trace OSG usage Discussion of random number generation on 64 bit machines. Worries on using PBS instead of condor Wait and see approach to dCache vs. GPFS WAN is 1 Gbps with very fast local fiber ring but connection to NYC is OC48. Joining new New Net (new part of Abilene) which will provide 2 x 10 Gbps. Outstanding issues: 1) CPU purchases - what to buy? 2) Hardware implications of new fast network. 3) Storage achitecture Considerable support for GPFS from Harvard people - much concern on cost of GPFS. Greg Cross is worried about some problems with GPFS - is able to use many systems. HyperNews spam incident at BNL - how to be open but not receive spam? Midwest Tier 2 - Greg Cross ================= Collaboration between UC and IU Project update New order just placed should arrive in about a month. 10 Gbps mostly in place at both sites. GUMS is working. Trying decide PBS vs Condor - sites will be "mirrors" of each other. Storage in on the invidual nodes using dCache. UC and IU sides will be "mirrored" - a single cluster configuration for ease of management. Next three months we will deploy our hardware. Southwest Teir 2 - Mark Sosebee ==================== Collaboration between University of Texas Arlington and Oklahoma. 160 work nodes dual EM64T @ 3.2 GHz on new cluster (with IBRIX) UTA 75 work nodes on older cluster UTA 40 work nodes dual EM64T (with IBRIX) OU 135 nodes on older OSCER cluster (shared) 512 nodes on new OSCER cluster (shared) UTA link is OC12 to peering with Internet2 in Houston (current- UTA) Future link is using NLR. OU is on NLR. New cluster is online at UTA Large effort devoted to DQ2 Scaling is a problem with IBRIX when the number of running jobs exceeds 150. UTA is working on the problems with the IBRIX support team. Razvan asks about the long term support for IBRIX. Lots of discussion on IBRIX/storage. UTA has held to workshops (March and May) to promote physics analysis. Great Lakes Tier 2 - Shawn McKee ===================== Michigan and Michigan State Uses Michigan LambaRail. MSU has new cluster (mainly for D0) with dual-dual Opterons. Michigan has older cluster that they are still using. The cluster is installed with ROCKS. Using Cacti for monitoring. Using OCSnext generation to make a database of hardware characteristics. Makes extensive use of AFS to reduce exposure to nfs problems. Also using dCache. Still have some services that are not yet up (e.g. DQ2). There is a prototype cluster of dual-dual Opterons (5 worker nodes + head node). They want guidance on needs and time scales. It would be simpler if their equipment could be installed in March of 2007. Both Michigan and MSU have high quality space that will be available next spring. They are close to providing cycles. They should appear to be a single site to outside users. Saul asks why using both dCache and AFS? Ans. We have two >100 GB disk on each node that we want to take advantage of. Their long term plan is to use Condor but some systems are current using PBS/Torque Western (SLAC) Tier 2 Status - Wei Yang =============== Western T2 was approved in July. SLAC has long experience with OSG. They have successfully run ATLAS jobs via PandaGrid. Management Team (not charged to ATLAS): Richard Mount 10% Chuck Boeheim 15% Randy Melen 10% Technical Team (1 ATLAS FTE) Wei Yang 30-100% Brooker Bense 20% Lance Nakata 20% Scientific and Computing Services 30% Resources for OSG: 4 SUN V20z OSG Gatekeeper / Gsiftp GUMS VOMS 500 GB Storage Resources for ATLAS MySQL replica for CondDB DQ2 site server / Web proxy for Panda Pilots 500 GB NFS for DQ2 10 job slots in LSF for grid users. Access to LSF for local ATLAS users. AFS space for kit and environment Protype dCache Leveraged Assets: ~3700 CPU cores in LSF pool (RHEL3 & 4) ~30 CPU cores for interactive nodes RHEL 3 and 4 Challenges: Grid jobs overloading AFS Batch nodes do not have internect access (security reasons). Job Tranformations don't all use CondDB Security issues with DQ2 and MySQL Want to use Fair Share dCache Using RHEL instead of SL Need to decide on what hardware to buy Need to decide storage architecture Plans 67% funding storage / 33% funding CPU Large potential for leveraged CPU. Working on web page Torre asks about leverage resources. Richard explains how it is possible. Research Computing at Harvard (relevant to LHC) - John Huth ====================================== Samples of Harvard work: - Crimson Grid - Initiative in Innovative Computing - EGG project - Another kind of challenge: Inverse mapping problem. Crimson Grid - Joy Sincar Local grid using Condor Collaboration with GLOW (UW) / Miron Livny Initiative in Innovative Computing - Alyssa Goodman & Tim Clark Fills the gap between science and computer science EGG Project - Saul Youssef et al. Pacman extension "Market Place" driven resource allocation LHC Inverse Mapping Problem - N. Arakani-xxxxx and G. Kane How does real data map back into theoretical parameter space? Try a systematic approach. US ATLAS Facilities - Razvan Popescu ======================= Whats a baseline of what services that we need. Functions provided by Tier 2 centers: - Provide computing resources - flexible means for authentication - flexible resource allocation - provide tech support - provide monitoring and usage accounting - No interactive work, not direct user support CPU Resources via grid access CPU Resourece utilization - allocation policy services (inc. servicee agreement policies). BNL is using Condor (dropped LSF) - requires one full time FTE for support. Storage - two types: 1) Flexible high performance (NFS, IBRIX...) 2) Distributed low cost (dCache) Access control: - Provide multi-CA authenication - Provide audit trail Long discussion on how to provide the audit trail in a grid environment. Labs can not use group accounts (security). Monitoring and Accounting Need central information collection Tech Support Need to understand the role of Tier 2s Capacity Profile Doing well on CPU and somewhat behind on storage. Need to make adjustments to the model. CPU and Storage access. OSG Gatekeepers, OSG Gridftp, and SRM CPU Allocation Obviously use a queuing system but also need Service Agreements and/or Hard/Soft allocations. Storage and Data Management Provide "global" FS and low cost We need quota even in dCache Authentication and Authorization Services - GUMS - VOMS - dynamic mapping Monitoring and Accounting - Ganglia - Nagios - MonAlisa - OSG Accounting Support Must define the responsibility chain & call out procedures Need contacts RT System into production (RHIC has converted almost completely) Documentation - must still be written. An FTS/ToA Model - Dan Shrager ==================== Worked on DQ2 install with help from Horst and Patrick. DQ2 logical high level - unchanged inplements DDM current policies in ToA This level defines the data path logic. DQ2 low level - physical transport Implemented in FTS and ToA at a low level - intended to eliminate road blocks. Free access with certificate FTS Model Generates a list of serviced end points and NOT bi-directional channels. One (and only one) rule: any transfer request, regardless of its directions, is serviced as long as one endpoint is affiliated. ToA Format The ToA reverse format by report to the FTS direct list. Each site is tagged with the preferred FTS (eg BNLFTS or NOFTS). The ftsTopology section is eliminated - as redundent or incomplete. Features of this data model - Full load control for each endpoint (as opposed to channel load control) - Good partitioning at the Tier borders - requests with both endpoints foreign are denied - Default transfer priority: internal transfers take precidence over ones with one foreign endpoint. - Simplicity N enpoints vs N^2 channels - Use of reverse format - Default criteria fo dataset location can be based on FTS affiliation and/or capability. US ATLAS Tier 2 configuration Current FTS/ToA implementation supports free DDM model (advantage you can take data from closest location). Inter T2 transfers become possible and welcome for better data distribution. LRC in LFC formatting can be added to a site's DQ2 installation. European datasets become available on demand See shell scripted to talk. A pair of catch-all channels is required in FTS and ToA configuration. Conclusion about the possible new model DQ2 implementers - simpler, comphrenive DDM policy enforcement - site load control US ATLAS Tier 2 sites - unrestricted access Physicists - fee access to data. Kaushik says current design is based on early DQ2 version capabilities. Dantong points out that fts can have an advantage from knowing the closest servers. Dantong points out that IU & UC being closely linked should be the first choice for a data choice if the needed file is on the other site. Dantong what happens if European site wants data on US Tier 2. Torre there is a very important policy issue here. Richard Mount either you keep a strict hierarchal Tier 0/1/2/3... model or you need something quite sophisticated. Razvan is the all routing problems? How is this implemented? How can data be routed on the lowest level? Dan what happens if BNL is down? Proposal from Alexei: - Implement Dan's model for a month, and provide feedback at the DDM meeting at BNL. - Provide allocation controls at the site-level (to manage load), eg., number of simultaneous transfers. ATLAS Software Infrastructure - Fred Luehring ============================ ATLAS Software Infrastructure - Fred Luehring ============================ Overview of the SIT and responsibilities US SIT people - largely european effort, but significant number of US people involved. Hot issues of the SIT - release schedule - new linux/compiler versions - new platforms to support - new support people to cleanup CVS - geometry/conditions databbase release dist - kit improvements Release schedule - controlled by spmb/cmb, build is done by the SIT - sep 15, simul with 12.0.3 ... New linux - cern expects SLC4 by end of Oct - Sit is running nightlies for SLC4 - slc5 will be too late for lhc turn-on - slc4 includes gcc345 32 vs 64 bit - Nightlies are running that test native 64 bit versions - Goal is to have 64 bit mode validated by end of 2006 - This is for AMD64/EM64T - no plan to use MAC for production New SIT people at CERN - four people with various roles starting now at CERN - all will contribute to user support - some will work on documentation Cleaning up CVS - either new instance or new directory tree, to include only active packages - subversion not accepted now, perhaps in the future Conditions database versioning - apparent this is needed in the last few months - tag collector team looking at ways to tag database versions Atlas data security - worried about unauthorized use of atlas data and code - there is a team investigating this now - may be implications on grid data storage Kit Issues - A kit for every nightly - To run RTT tests on grid - reduce problems caused by AFS - One pacman file per project, not package, saving much time - Pacman mirrors Installing Atlas offline SW install - How? Pacman mirrors - Currently using central CERN cache - Suggest mirrors with hourly updates via crontab job for every site - Issue - how to trace which mirror was used - Run kitvalidation More on mirrors - reduced install time - multiple releases in same dir tree Conclusions - subscribe to hypernews discussions for Releases and Distribution Kit and Offline Announcements - report all problems to savannah!! Panda - Kaushik De ============ Panda Development - Robust local pilots - DQ2 enahncements Local pilot submissions Xin is working on this. Robust pilot job submissions Originally developed / tested at UTA by Patrick Now in use at BNL Marco is continuing parallel development of CondorG scheduler Multi-tasking Pilots (Paul Nilsson) Primary goal is to add the ability to inject short analysis jobs into CPU that is busy with a long job. Accounting and Quota Systems New development by Sudmash Reddy and Torre. Data Service for non-DQ2 Sites New work by Karthik, Horst, and Marco Work is to enable use of sites that do not have DQ2. Panda Usage Model Many types of users - Panda must deal with them all. A number of cases: - Managed ATLAS Production - Regionally submitted jobs - Individual users submitted jobs - Others??? Discussion on security of the grid. Richard Mount is worried about our high level of visibility making the HEP grids a target. Screen shot of Panda monitoring. US Production Done with Panda Panda User Monitoring Display of which users submitted Panda jobs. Panda Usage Accounting Tracks who is using what. Conclusions Alexei Klimentov has been appointed to DDM coordinator OSG Status and 0.6.0 - Rob Gardner ====================== Much of talk taken from talks given for OSG/EGEE and WLCG meetings. OSG Service Stack: NMI, VDT, OSG Release OSG Service Overview Compute elements GRAM, GridFTP, GIP Storage elements SRM-drm, SRM-dCache Site level services GUMS VO level services VOMS VO edge services Multi-VO OSG Timeline See talk for diagram. ITB 0.5.0 is under development OSG 0.6.0 is under development Current OSG Release Description - VDT 1.3.10 - Privilege infrastructure - GT4 GridFTP - GT4 Pre Web Services. Slide showing contents VDT 1.3.10 Privilege Authorization Services Site level services to support role-based access to Tier2 resources Receives updates on mappings from VOMS Reverse map created for accounting Authorization Process GUMS server can be a single point of failure. ATLAS is steadily consuming OSG resources. Middleware Release Roadmap OSG 0.6.0 Fall 2006 Accounting Squid SRM V2 + AuthZ CEMon-ClassAd Support for MDS-4 Possible requirement to use WS-GRAM Edge Services Framework gLexec OSG 0.8.0 Spring 2007 Just in time scheduling, Pull-Mode Condor-C Support for sites to run Pilot jobs and/or Glide-ins OSG 1.0.0 End 2007 Edge Services Framework Goal is to support deployment of VO services Based on XEN virtual machines and service images Site supports a XEN server and VO loads images Accounting System is based on probes and collectors Has publishers for reports and querying Storage Authorization Developed at SDSC (high priority for CMS also). Resource Selection ATLAS requirements - Oppourtunistic access to non-ATLAS sites without DQ2 o Uberftp on computing nodes - Improving DQ2 installation o glite-*-* o CGSI_gSOAP o edg-gridftp-client-1.2.5-1 Patrick worries that dq2 is evolving too fast to have these pieces of software be useful. Rob points out that it may help other VOs interested in LCG interoperability. ATLAS wish list? - Accounting - can OSG service be used by RAC???? o Test with USCMS T2 o Test in ITB o Compare with Panda collected statistics o Need to collect and report non-Panda usage of resources - Managed storage withing SRM/dCache SEs (Right now "wild west storage") -... More information on what's coming... - See ITB release description - ATLAS BO page in OSG - Next week's OSG consortium meeting (for details please see urls in slides) Dan asks about private network pools. Rob answerst that we can bring this up. Ultra Light Update - Shawn McKee ===================== Status Update 1) New kernel 2) Kernel Developement Found that a tuned kernel was needed. Light Path Plans New light path technologies are emerging. VINCI being developed by Iosif Legrand. Shows a demo of dynamic light path building. LISA, EVO and Endhosts VRVS is being replaced by EVO (Enabling Virtual Organizations) Merger of MonAlisa and VRVS FTS and UltraLight Work to improve communication between physics people and networking/data people. Shows a number of slides from Harvey Newman Please see slides for details - this talk went by very quickly. TeraPaths Project Team - Dantong Yu ======================= TeraPaths is complimentary to UltraLight. Enhanced network functionality with QoS features QoS Tools - IntServ o RSVP - DiffServ - ... TeraPaths Projects New Development Going from last mile to end to end QoS. Using web services Site Bandwidth Partitioning Scheme Acquired Experience - Enabled and tested LAN QoS inside BNL campus network - Tested and verified MPLS betwen BNL and Michigan In progress & future work - Add GUMS - Develop site-level network resource manager - Support dynamic bandwidth/routing adjustments User Support - Dantong Yu ================ Shows current web support scheme. They are phasing out CTS (home grown) ticket system for RT. Evaluating RT system - so far so good. There is an US ATLAS Operator on call 9:00 am - Midnight Shows multi-level call list. Do calls result in tracking bugs? People should always file bug reports before calling. Rob asks question about traceability issue with phone calls. What if someone else is interested in the problem? ==================== Day 2 Session Close Out Reports ==================== Facilities - Richard Mount =============== How quickly will the storage needs rise? Seems like we may need more storage than the original plan called for. Need to look at storage ramp up. There is a lot of monitoring software available that we need to learn about. What about sites that are configured so that no OSG jobs will run. We need to support OSG in addition to Panda. We need to be a green dot on the OSG map. Network - Rich Carlson ============== Many milestones from previous meeting but few milestones are met. Need NDT for all sites. Must be installed from a CD. Need network diagram that includes all sites. Implications of connecting sites - what is the infrastructure. Storage and Data Services - Torre Wenaus ========================== Primary choice for storage is dCache. Secondary choice is xrootd (given expertise at SLAC). Other groups looking at GPFS and Ibrix. Lustre will likely not be used. Data Hosted at Tier 2 Full set of AODs will be stored on the Tier 2s. Each Tier 2 has to: Each T2 has be able to connect to the T1 using FTS. Will test the 5 T2s with the T1 soon. [Sidebar about dCache] Concerns about the feasibility of dCache: scalling & manageability. Want to tap into the SLAC expertise. Want to probvde a stable DQ2 service. Want to get away from site internal path & position information. We need scheme to manage partitioning of production data areas until srm is available. Sites should use an alias for the endpoint host. We need to provide a standard US LRC. Need to separate LRC DB service from site service DB. Need US ATLAS standard tools for space management. Management should be done at US ATLAS management level not at a local site level. Files have been lost at Tier 2 sites. Support data access tools for US access to data at LCG sites. Support US ATLAS security policies and procedures for catalogs and data. Policy and Accounting Issues - Rob Gardner ========================== Provide current and project-planned inventory of resources (CPU, storage, ) Site Policy quickly got very detailed when looking at the template. Provide a single place for policies. Develop a coherent strategy for accounting. Develop a RAC friendly accounting portal. Translate RAC policies into site level implementations. Define a multi-site quota/shares system (much of this inherent within Panda). Deploy/implement quota policies as specified by the RAC. Verify that policies are being implement by Tier2 sites. See Twiki page for list of milestone dates. Operations and User Support Fred Luehring =========================== Grid middleware and ATLAS software version upgrades. - all sites now at OSG 0.4.1; prepare for OSG 0.6.0. - Need to test on ITB - Need clear plan for validation - Need plenty of warning. - Understand the SLC4 upgrade - Installing releases - Xin/Yuri/Tomasz - - should start using mirrors - how to synch multiple mirrors? - Concern or adherence to policy of freezing code i.e. applying patches without incrementing the release number - Need plan for propogating the mirrors after the Security - Crypto cards now required at BNL for interactive users - Security documents needed, eg., security response plan Ticketing and User Help requests - probably need three systems - RT, savannah, and hypernews - Hypernews - migrate to CERN now. - Should be a DDM operations group, not handled by individual shifts at the site - Panda shifters should categorize the user requests Monitoring service availability evaluation - quantify service response FAQs and information about the Tier2 sites - try to make common Support for Tier3s - example - how to install DQ2 at Tier3 sites Closeout - Jim ========= Next meeting: One day meeting, December 8, warm locale
There are minutes attached to this event. Show them.