US ATLAS Computing Integration and Operations

US/Eastern
virtual room (your office)

virtual room

your office

Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 1
      Top of the Meeting
      Speaker: Kaushik De (University of Texas at Arlington (US))

      Present: Michael, Dave, Fred, Kaushik, Bob, Saul, Armen, Mark, Alden, Horst

      Apologies: Rob

      From Rob:

      1) There is a nice summary of the sites jamboree from Ale  https://indico.cern.ch/event/440821/

      2) More generally, there were interesting discussions at the WLCG meeting,  https://indico.cern.ch/event/433164/other-view?view=standard and summaries are being posted.

      3) There is a cloud-level action item from Alessandra about publishing maxrss values, agreed at the jamboree.

      Other items:

      Reminder OSG AHM at Clemson Mar 14-17: https://indico.fnal.gov/conferenceDisplay.py?confId=10571

      Latest ADC weekly meeting for general ATLAS info: https://indico.cern.ch/event/469716/

      Michael: OSG technology area request to move to OSG v 3.3. Bob and Dave have tried this version and found issue with dccp. Fed back to OSG ops meeting by Xin. They will look into it. Expecting reply soon from Bockelman. We encourage other US sites to test this version, since 3.2 will go away in the near future.

      Saul: at NET2 ran into problem with SLURM and filed ticket while doing upgrade to HT Condor CE (including new OSG release).

    • 2
      Capacity News: Procurements & Retirements
    • 3
      Production
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 4
      Data Management
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 5
      Data transfers
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 6
      Networks
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 7
      FAX and Xrootd Caching
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • Site Reports
      • 8
        BNL
        Speaker: Michael Ernst
        • OSG Technology Area is asking sites to transition to OSG SW release 3.3.x a.s.a.p. as they want to drop support for OSG 3.2 by ~August
          • I have sent a respective announcement to the US ATLAS T2 list and received feedback from AGLT2 and MWT2 indicating lack of support for dccp in rel 3.3
          • This was brought up by Xin at yesterday's OSG production operations meting. The OSG SW team is looking into the issue. US ATLAS Facilities are expected to receive a response from Brian Bockelman (who heads the OSG Technology Area) 
        • Smooth operation of the Tier-1 center over the course of the last 2 weeks, utilization of CPU at capacity
          • Reprocessing of 2012 data running since late last week on T1s and T2s worldwide
          • T1 at BNL is leading the league of sites by a large fraction: overall contribution of 33%, followed by RAL (9.5%) and SIGNET (8%), a lot of stress is on the tape system, excellent staging performance of up to 50TB/60k RAW data files retrieved from tape in 24h: http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=All+T21&sitesCat[]=All+Countries&activities[]=reprocessing&resourcetype=All&sitesSort=5&sitesCatSort=0&start=2016-02-01&end=2016-02-03&timerange=daily&granularity=Hourly&generic=0&sortby=11&series=All
        • At the T1 we are in the process of implementing the memory configuration as requested by ADC at the WLCG collaboration meeting
          • We've made the following changes in AGIS. :
            queue name                                    maxrss(GB)        minrss(GB)
            BNL_ATLAS_2                                     8                             2.5
            BNL_PROD                                          5                             0
            BNL_PROD_MCORE                          24                              0
            BNL_PROD_MCOREHIMEM               64                            24
            ANALY_XX queues                              3                              0
        • The first part of FY16 disk storage procurement has arrived, ~2.3 PB of usable disk space.
        • Article about BNL/ATLAS AWS cloud work in InformationsWeek at http://www.informationweek.com/cloud/infrastructure-as-a-service/brookhaven-lab-finds-aws-spot-instances-hit-sweet-spot/d/d-id/1324145
      • 9
        AGLT2
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        The dCache upgrade worked without a hitch; unfortunately, dCache itself did not.  We upgraded to version 2.10.51-1 from 2.10.42-1.  In the end, we rolled back to that 2.10.42-1 and the OOM conditions went away.  The problem was introduced at 2.10.49-1 in the 3rd party Netty library.  To quote Gerd Berdman, "The bug causes problems with throttling file loading when a slow HTTP client reads the file. Due to this the data gets queued in memory and for large files you will quickly run out of heap memory."  Netty was downgraded and dCache 2.10.52-1 has now been released.  However, we will not now do that upgrade, choosing instead to move to dCache 2.13 sometime towards the end of February.

        All MSU R630 are now running jobs in Condor.  This is the last of the 2015 funds, and the v38 spreadsheet has been updated accordingly.

        It is no longer needed (perhaps for some time now) to notify the OSG goc that a change in APEL Normalization factor has been made.  As long as the value in the resource is updated in OIM, it should propagate correctly within 24 hours into the WLCG reporting.  If it is NOT seen to update, then it should be reported.

        The flapping NIC on msufs02 was fixed by re-seating the SFP+ cable at the EX4500 switch end.  However, some problems continue with the switch ports of this unit.  It will be updated to a newer software version sometime soon.

        The new AGIS parameters minrss and maxrss were updated this morning, taking on reasonable values (we hope) for our site.

        Over the next month we will prepare to update our site software, that will probably take place towards the end of February.  This may require a site down time of a day or two, but we are considering doing a rolling update over all WN.  Software to be updated includes OSG-CE, OSG-WN, cvmfs, HTCondor, and other small pieces as needed.

         

      • 10
        MWT2
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is running well

        • Full of Atlas jobs (MCORE, SCORE, Analy and Opportunistic)
        • Good efficiency

        IU nodes now operational

        • Over 13500 cores
        • HS 133,303, APEL Factor of 9.87
        • Accounting updated in OIM, REBUS, V38 and WLCG-V38

        New Disk at UChicago

        • Ceph based
        • Will move MWT2_UC_LOCALGROUPDISK from dCache to Ceph
        • Might change name to MWT2_LOCALGROUPDISK to aid in transition
        • Lincoln is testing by using gfal-copy to push data into the system

        OSG 3.3.8

        • All head nodes have been using 3.3.x stack for a long time without problems
          • CE (HTCondorCE)
          • Squid
          • CVMFS servers/clients
          • GUMS
          • Condor 8.4.3
        • Still using 3.2.24 on worker nodes
          • DCAP removed, we use in LSM
          • Working on LSM update to remove DCAP
          • Will use GFAL2 (gfal-copy, gfal-rm, gfal-sl)

        Virtual Memory issues

        • Large jobs causing many problems
          • OOM killing other jobs
          • Nodes hanging/crashing
          • lostheartbeat
        • Upgrade to HTCondor 8.4.3 and cgroups help control large jobs
          • cgroup "soft" allows flexible RSS
          • hard virtualmemory limit puts jobs into HELD
        • Exposed inconsistent swapfile policy (little to no swap on some nodes)

        FAX Door issues

        • Doors at IU were causing problems
        • Some type of internal IU networking issue (internal low level packet loss)
        • Moved all doors to UC
        • Will be moving doors off storage nodes onto VM like SRM (FAX and WebDAV)

        WebDAV certificate issues

        • Door is currently on a storage node (uct2-s13.mwt2.org)
        • Needed a subject with SubjectAltName
          • webdav.mwt2.org
          • uct2-s13.mwt2.org
        • Now supported in OSG PKI tools (osg-gridadmin-request -a)
        • CI-Logon support added this Monday (2/1/2016)

        minRSS and maxRSS now set

         

      • 11
        NET2
        Speaker: Prof. Saul Youssef (Boston University (US))

        New 960 cores have been installed last week and are running.

        Working on HTCondor-CE/new OSG version installations at both BU and HU.  HU has a problem with SLURM, in communication with OSG and Nebraska.

        Working on WAN upgrade plan.

        Smooth operations, site has been full.

      • 12
        SWT2-OU
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - all sites running well, no issues

        - continue working with OSCER admins to bring new OSCER cluster, Schooner, online for ATLAS production; this will be osg-3.3 on CentOS7

         

         

      • 13
        SWT2-UTA
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
      • 14
        WT2
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 15
      AOB