US ATLAS Computing Integration and Operations

US/Eastern
Other Institutes

Other Institutes

Description
http://www.usatlas.bnl.gov/twiki/bin/view/Admins/IntegrationProgram.html
    • 13:00 13:10
      Top of Meeting notes 10m
      Speakers: Michael Ernst (Unknown), Robert William Gardner Jr (University of Chicago (US))

      Attending

      • Meeting attendees: Michael, Dave, Rob, Bob, Armen, Torre, Ilija, Saul, Wei, Shawn, Alden, Ilija
      • Apologies: Doug, Jason, Horst
       

      Rob

       

      • Registration open for the OSG CIC workshop at Duke University
      • See campusgrids.org

      Michael
      • Moving towards IaaS
      • Pre-GDB meeting yesterday discussing this.
      • Lots of material from the experiments
      • From ATLAS - lots of progress has been made; in full production on 16k cores.  Condor scaling issue encountered, resolved.
      • Next step is to look into the concept more in general.  At CERN, IT has invited the experiments to evaluate possibility of using IaaS without CEs or batch systems.
      • Making reasonable progress towards moving to SL6, but we're not there yet. By end of month for sure.



       
    • 13:10 13:15
      Integration points 5m
      Speaker: Robert William Gardner Jr (University of Chicago (US))
      • Review of storage deployment, SL6 migration, gLexec 5m




        The transition to SL6

        MAIN REFERENCE

        CURRENTLY REPORTED

        last meeting(s)

        • All sites - deploy by end of May, June
        • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
        • Doug - provide a link from the SIT page. Notes prun does compilation.
        • Main thing to consider is whether you upgrade all at once, or rolling.
        • BNL will be migrated by the COB today! Will be back online tonight. BNL did the rolling update.
        • Look at AGIS - changing panda queues much easier
        • Are the new queue names handled reporting? If they are members of same Resource Group.
        • What about $APP? Needs a separate grid3-locations file. But the new system doesn't use it any longer.
        • Schedule:
          • BNL DONE
          • June 10 - AGLT2 - will do rolling
          • MWT2 - still a problem with validations; could start next week
          • SLAC - week of June 10
          • NET2 - all at once. Week of June 17
          • UTA - all at once. June 24. Lots of dependencies - new hardware, network. A multi-day outage is probably okay.
          • OU - all at once. Rocks versus Puppet decision. After July 5.
        • Goal: Majority of sites supporting the new client by end of June. May need to negotiate continued support

        this meeting

        • BNL DONE
        • MWT2 DONE
        • AGLT2: 1/3 of worker nodes were converted; ran into a CVMFS cache size config issue, but otherwise things are going well. The OSG app is owned by usatlas2, but validation jobs are now production jobs. Doing rolling upgrade. They are using the newest cvmfs release. n.b. change in cache location. Expect to be finished next week.
          • Fully migrated. 
        • NET2: HU first, then BU. At HU - did big bang upgrade; ready for Alessandro to do validation. Ran into problem with host cert. 2.1.11 is production. One machine at BU. Hope to have this done in two weeks. BU team working on HPC center at Holyoke.
          • HU done
          • BU: GPFS testing is complete.  Top priority. Augustine will be working on this non-stop.  
        • SWT2 (UTA)
          • End of week will be migrated. OU will be coming on Monday to go over the Rocks6.
        • SWT2 (OU)
          • Will be visiting on Monday. OSCER queues doing validation now.  
        • WT2: Failed jobs on test nodes - troubleshooting with Alessandro. Expect to be complete by end of next week.
          • Some validation jobs are failing, and also with Analysis queue
          • Writing to CVMFS?
          • No response from Alessandro. 

         

        Updates from the Tier 3 taskforce?

        last meeting
        • Report is due by July
        • Doing testing of Tier 3 scenarios using grid or cloud resources
        • Working with AGLT2 as a test queue.
        • Managed to get surveys from every Tier 3 site. Writing assignments will be setup for the final report.
        • Half the community does not have resources on their campus.
        • Solve the data handling problem to local resources; as fully supported DDM endpoint. gridftp-only endpoints were never fully supported.
        • Survey report will be available in two weeks

        this meeting

        •  
        •  
        •  

         

         




        Glexec


        Claim is these instructions work:

        https://www.opensciencegrid.org/bin/view/Documentation/Release3/InstallGlexec

        We will need a volunteer to test this out and give feeback to the group.
        • AGLT2 can start looking at it in a couple of weeks
        • NET2 would like to help
        • Torre: notes discussion in ADC management, preparing statement to the effect that this would be optional for sites.


        Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

        last meeting(s)
        • June 1 is the milestone date to get all sites on.
        • BNL DONE, AGLT2 DONE, 2 sites from MWT2 2/3 DONE (*MWT2_IU needs action, see below.)
        • SLAC DONE

        notes:

        • Updates?
        • OU - status unknown.
        • UTA - conversations with LEARN, UTA, I2 are happening. There has been a meeting. They are aware of the June 1 milestone.
        • NET2 - new 10g link is setup. 2 x 10 g to HU. Chuck is aware of the June 1 LHCONE milestone. Saul will follow-up shortly, expects no problem by June 1.
        • IU - plan is to decide friday whether whether we need to bypass the brocade, access Juniper directly to peer with LHCONE. Fred is working closely with the engineers.
        • Shawn - Mike O'Conner has been putting together a document with best practices. Will have examples on how to route specific subnets that are announced on LHCONE.
        • Three configurations: 1. PBR (policy based routing). 2. Providing a dedicated routing instance. Virtual router for LHCONE subnets. 3) Physical routers for gateway for LHCONE subnets.
        • NET2: have not been pushing it, but will get ball rolling again - will contact Mike O'Conner and provide feedback.
        • OU: there was a problem at MANLAN which has been fixed. Direct replacement from BNL to OU. Will start on LHCONE next.
        this meeting
        • Updates?
        • IU is actually on LHCONE, but not on the 100g link. Hopefully later this week.
        • OU and UTA: blocker is getting appropriate examples on how to do this.
        • NET2: no update.  
        • Shawn will get names of people who should be involved.  Will open an email.
        •  
        •  
        •  

        Storage

        • UTA storage expected to be online by Friay



      • Compute Server Subcommittee 5m
        Speaker: Robert Ball (University of Michigan (US))
    • 13:15 13:25
      Production and Operations 10m
      Speakers: Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US))
      Overriding issue is the # transferring jobs.  Backlog situation is improving.   Rod Walker claims running = 2 x transferring should be set in AGIS. 

    • 13:25 13:30
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:30 13:35
      DDM Operations 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
      Claims backlog is caused by DDM SS not submitting jobs quickly enough. A separate DDM SS just for DATADISK - and situation is improving dramatically.   Not related to FTS, network, or sites at all.  Production transfer is not higher priority, so they are not chosen.   Note also SS are operated at CERN. 
    • 13:35 13:40
      Networking and Throughput 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:40 13:45
      FAX 5m
      Speakers: Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:45 13:50
      Site reports: BNL 5m
      Speaker: Michael Ernst (Unknown)
    • 13:50 13:55
      Site reports: AGLT2 5m
      Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:55 14:00
      Site reports: MWT2 5m
      Speakers: Mr David Lesny (Univ. Illinois at Urbana-Champaign (US)), Sarah.elizabeth Williams (Indiana University (US))
    • 14:00 14:05
      Site reports: NET2 5m
      Speaker: Prof. Saul Youssef (Boston University (US))
    • 14:05 14:10
      Site reports: SWT2-OU 5m
      Speaker: Dr Horst Severini (University of Oklahoma (US))
    • 14:10 14:15
      Site reports: SWT2-UTA 5m
      Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    • 14:15 14:20
      Site reports: WT2 5m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:20 14:25
      AOB 5m