US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      3.4.27

    • 13:20 13:40
      Topical Report
      • 13:20
        HPC Operations update 5m
        Speaker: Doug Benjamin (Duke University (US))
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • SL7 migration in progress, will be done in April as planed
        • T1 farm drains a bit over the past several days, not enough pilots coming, under investigation with harvester team
        • FTS upgraded
      • 13:50
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Services:

        Had two tickets for failed jobs due to not having enough space in job work directories. We have both zombie files (deleted but still occupied by processes) from both BOINC and Condor when the jobs get killed, deployed scripts to scan and remove these files. For Condor, this is because the new atlas wrapper creates the work dir outside of the condor home directory, so when condor kills a job(i.e., overusing memory jobs), it fails to remove the work dir for the job. 

        Overall, the utilization rate of Condor is significantly improved, in average, 99% of the cores are claimed by ATLAS jobs, with the BOINC  backfilling jobs, the CPU utilization of the cluster reaches 93%. 

         

      • 13:55
        MWT2 5m
        Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Added UCORE configuration on our gatekeepers. Noticed an issue specifically with UCORE jobs on condor workers with cgroups enabled. Currently disabled cgroups at IU and UIUC (already disabled at UC).

        Networking issues over the weekend at UIUC took our Illinois infrastructure offline. This was resolved by ICC admins Sunday.

        Hypervisor issues at UC took the entire site offline for half a day. This has been resolved now.

        The new UIUC nodes have arrived. We're working on getting them into the MWT2 configuration and benchmarked. The new IU nodes should be online this week.

         

         

      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        We had a problem with Gratia reporting causing most of March not to make it to OSG and WLCG.  We don't understand why, but it's fixed after an update of the OSG software on our CE.  Gratia used to be nice about automatically picking up the old records, but it's not doing it this time.  We could use a bit of help on this.

        Our current highest priorities are:  OSG update -> Fix rucio globus problem -> switch from LSM to rucio-mover -> SL7 upgrade -> Singularity

        NESE work is in high gear.  RH and Harvard are testing the initial deployment.  Networking from NET2 to NESE is in place.  Making an ATLAS DDM endpoint will be next, done in parallel with Globus endpoint and various Posix testing.  

        We've noticed an odd problem where a batch of phys-higgs tgz files have the right Adler checksums, but they are not valid zip files.  This probably has nothing to do with us, but we will gather more information and do something with it.

        Like BNL, we're having an issue that Harvester isn't keeping our site full.  Will get in the loop with Xin as it's fairly likely to be a similar problem.

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - OSCER site running well

        - OSCER_OPP pilots are having certificate issues, investigating

        - LUCILLE seeing ipv6 issues, investigating; makes no sense, since nothing has changed there

        UTA:

        Still investigating issue with HTCondor-CE and delegated proxy renewals.  BNL's change prevents whatever local issue is happening from being seen by the pilot.  Not true in UCORE/Harvester

         

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))
      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Nothing . 

         

        LOL!

    • 14:25 14:30
      AOB 5m