US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:15 13:20
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:20 13:30
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:45 13:50
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:00 14:10
      OS performances testing 10m
      Speaker: Doug Benjamin (Duke University (US))
    • 14:10 14:25
      HPCs integration 15m
      Speaker: Taylor Childers (Argonne National Laboratory (US))
    • 14:25 16:00
      Site Reports
      • 14:25
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • 14:30
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        We have updated all of our gatekeepers to the newest OSG release, 3.3.25.  In conjunction with the update we worked with an OSG team to establish the new lcmaps-voms mapping described at this URL:
        https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/InstallLcmapsVoms

        This has been running at AGLT2 now for a week without any issues that we are aware of.

        The released set of voms mappings, in /usr/share/osg/voms-mapfile-default, is not yet complete.  In particular the atlas mappings are not yet updated.  However, 2 over-ride files, and two "ban" files, located in /etc/grid-security, are searched first for account mappings prior to searching the default file.  Only the first mapping, based upon the First FQAN of the presented certificate, or the identity portion of the presented certificate, is used to do the account mapping.  We have put in place the following voms override file at AGLT2

        [root@gate03 ~]# cat /etc/grid-security/voms-mapfile
        "/atlas/usatlas/Role=production/Capability=NULL" usatlas1
        "/atlas/usatlas/Role=software/Capability=NULL" usatlas2
        "/atlas/usatlas/Role=lcgadmin/Capability=NULL" usatlas2
        "/atlas/Role=lcgadmin/Capability=NULL" usatlas2
        "/atlas/usatlas/Capability=NULL" usatlas3
        "/atlas/Role=production/Capability=NULL" usatlas1
        "/atlas/Capability=NULL" usatlas4
        "/atlas/calib-muon/Role=NULL/Capability=NULL" muoncal
        "/osg/ligo/Role=NULL/Capability=NULL" ligo
        "/fermilab/*" fermilab
        "/cms/*" uscms01

        We also have a /etc/grid-security/grid-mapfile to map the specific DN in use at AGLT2 for the muon calibration effort.

        [root@gate03 ~]# cat /etc/grid-security/grid-mapfile
        "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=diehl/CN=490810/CN=Edward Diehl" muoncal

        Once AGLT2 has modified dCache to use a similar mechanism, we will no longer be dependent on GUMS and will turn off our service.  OSG in the future will deprecate and then entirely remove their GUMS support.

        The dCache pgsql DBs crashed out over the past weekend, causing all jobs to fail.  It seems that the auto-vacuum is not working for our pgsql 9.5.7 instance.  This is under investigation, but as of 3pm Monday, we were back in business.  As a reminder we are running dCache 3.8.11.

        AGLT2 will go offline at Noon Friday for a complete power outage in the UM server room.  We hope to be back up by day's end on Monday after multiple maintenance items are completed.  The switcher2 set our queues offline at Noon today.

        singularity is installed on all AGLT2 WN.

         

      • 14:35
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is now full of jobs and operating well

         

        Problems over the last three week

        The roof-top condenser fan at UChicago failed two weeks ago.  The building engineers installed a temporary fix while unit is being replaced.  

        • Required some reduction in workers for a day to lower temp in room

        Illinois

        • 10K DDN system lost an entire disk tray
        • Caused loss of any disk redundancy
        • ICC Admins migrated all data to 12K DDN
        • Unfortunately other bad disks cause the loss of some data but nothing important
        • Site was down for 4 days to do backup/reformat/restore
        • One positive is the FS is now reformatted with GPFS 4.x
      • 14:40
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Normal operations except for 

        a) The slurm database is causing occasional problems at NET2/Harvard.  We're working on it. 

        b) We're repairing a steady stream of old nodes with failing plastic fans.

        We sped up scanning of GPFS so we can update space tokens every ~6 hours rather than every 24 hours.  This is re: central deletion.

        NESE activities are ramping up; testing POC CEPH cluster; planning first major purchases; network redesign for MGHPCC floor is underway. 

         

      • 14:45
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))
      • 14:50
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        1) Generally smooth operations over the past three weeks.

        2) Two separate storage issues that resulted in ggus tickets: (i) faulty hard drive took a storage server down; (ii) problem with a NIC in a storage server. Both fixed.

         

      • 14:55
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:00 16:05
      AOB 5m