US ATLAS Computing Integration and Operations

US/Eastern
virtual room (your office)

virtual room

your office

Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speaker: Robert William Gardner Jr (University of Chicago (US))
      • OSG AHM reminder: https://indico.fnal.gov/conferenceDisplay.py?confId=10571 , and US ATLAS Facilities meeting, March 14-17, 2016, Clemson University.
      • ESnet LHCONE site coordinator meetings

        • OU - Friday, Feb 12, 12 pm CST  done

        • IU - Thursday, Feb 11, 9am CST  done 

        • UTA: Wednesday, Feb 17, 2pm CST (today)

        • BU - being discussed - NOX versus MIT at MANLAN; will check status bi-weekly

        • Duke - (likely this Friday)

        • UT/TACC - TBD, pending LEARN-Esnet peering

       

      • OSG 3.3 upgrade discussion.  3.2 will be deprecated.

       

       

    • 13:15 13:25
      Capacity News: Procurements & Retirements 10m
    • 13:25 13:35
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:35 13:40
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:40 13:45
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:45 13:50
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:50 13:55
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:15 15:15
      Site Reports
      • 14:15
        BNL 5m
        Speaker: Michael Ernst

        Smooth operations for the last 2 weeks

        • Running at capacity, mostly MCORE (production) jobs
          • Heavily dominated by reprocessing jobs (2015 data) for the last ~10 days
        • Disk storage is tight, free space is <1 PB

        Preparing for the AWS 100k core scale test with the Event Service that is scheduled for next week

        min/max RSS implemented according to ADC request

      • 14:20
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        No news is good news?  We are running well. 

        Preparation is ongoing for a full software update at AGLT2.  This will include the glibc fixes, that should appear overnight tonight in our rpm repos.  osg-wn client will be 3.3.8 (but may move to 3.3.9....), cvmfs will be 2.1.20, HTCondor will be 8.4.3.  Osg-ce work on our test gatekeeper is ongoing today, but to today's current 3.3.9 version.

        dcap rpms and lcg-util rpms, that are still needed by our lsm* utilities, were taken from EPEL.

        Next week we are planning a likely upgrade of dCache from the 2.10 series, to the 2.13 series.

        Note that OSG rpm suites require java 1.7, but if java 1.8 is also installed, and set to be the default, the OSG software will still run fine so we will put java 1.8 as the default everywhere.  See osg ticket https://ticket.opensciencegrid.org/28484

         

      • 14:25
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site has been  running well

        • Full of Atlas jobs (MCORE, SCORE, Analy and Opportunistic)
        • Good efficiency

        Illinois down for PM on campus cluster

        Updated glibc pushed to all nodes

        New Disk at UChicago

        • Ceph based
        • Migrating LOCALGROUPDISK
        • Lincoln is using gfal-copy and srm to copy from dCache to Ceph but slow
        • Currently migrated 193TB out of 368TB
        • Need Kernel 4.4 to fix controller problems

         

        OSG 3.3.9

        • All head nodes have been using 3.3.x stack for a long time without problems
          • CE (HTCondorCE)
          • Squid
          • CVMFS servers/clients
          • GUMS
          • Condor 8.4.3
        • Still using 3.2.35 on worker nodes
          • Testing new LSM
          • Used GFAL2 via xrootd, then srm, the fax to try and stagein the file

         

        minRSS and maxRSS now set

        • MCORE needs 24GB for reprocessing jobs
        • Changed HTCondorCE to request RSS of 24GB (previously 16GB)
        • Many nodes are only 2G/core so this can cause idle cores due to lack of free memory
        • Might create MWT2_MCORE_HIMEM to handle jobs > 2GB core
          • Redirect to only node with 3GB or more per core
          • MWT2 has almost 4000 cores which fit this criteria

         

      • 14:30
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        We have had smooth operations in the past two weeks, with only one brief SRM incident.   Running lots of reprocessing (maxrss is already 24000 for our MCORE queues).  The main things we are working on are:

        o Bringing the new storage online and into GPFS

        o Transitioning to HTCondor-CE on the BU and HU side

        o Re-formulating our WAN update plan

        o Preparing to join LHCONE

        o Working with MOC to add a pool of MOC worker nodes

        There is also a problem with HU availability reporting via SAM which is broken and still has to be tracked down.

         

      • 14:35
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - all sites running well

        - updated maxrss in AGIS for all sites

        - Lucille had a downtime on Monday for OS and OSG (3.3) updates

        - hopefully more news about OSG (3.3) testing of the new OSCER cluster soon

        - migration of OCHEP cluster from Internet2 to ESnet LHCONE connection under way

         

      • 14:40
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
      • 14:45
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 15:15 15:20
      AOB 5m