US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-04-25T13:00:00-04:00
End: 2018-04-25T15:00:00-04:00
Location: No location set

Wednesday 25 Apr 2018, 13:00 → 15:00 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:05
  
  Top of the Meeting 5m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  slides
- 13:05 → 13:10
  ADC news and issues 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
  
  Status Update for USATLAS Sites on OSG Migration (1).pdf
  - Squad 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
- 13:10 → 13:20
  OSG news 10m
  
  Minutes
  
  Speakers: Brian Lin (University of Wisconsin), Brian Paul Bockelman (University of Nebraska Lincoln (US))
  
  https://opensciencegrid.github.io/technology/policy/service-migrations-spring-2018/
  - Mitigations 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
  - OSG Services Migration 20m
    
    Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
    
    OSG-Service-Migration-Update.pdf
- 13:20 → 13:25
  
  Production 5m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-4_18_18.pdf
  
  shift-summary-4_25_18.pdf
  
  US-cloud-support-4_25_18.pdf
- 13:25 → 13:30
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
- 13:30 → 13:35
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:35 → 13:40
  Networks 5m
  
  Minutes
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  OSG network services are migrating to AGLT2_MSU's VMware instance.
  
  New VMs (psetf, psrsv and psconfig AND their ITB versions) already are created.
  
  Plan is to test new VMs to ensure they are correct, do a final service data migration and then cut-over the DNS IP addresses for *.opensciencegrid.org to put them into production
  
  OSG ESmond (central MA) and associated VMs will be shutdown (7 of them)
  
  Today was a HEPiX Network Function Virtualization working group met today: https://indico.cern.ch/event/715631/
  
  Recording will be posted soon for those that missed it.
- 13:40 → 13:45
  
  XCache 5m
  
  Minutes
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  Technical details discussed at the regular XCache Monday meeting.
  
  Wei and Xin got xcache robot cert copies. Wei and Shawn will try Lincoln's instructions to set up their own k8s clusters as soon as they get hardware in place. Ilija will test how things get deployed at Utah cluster. Helm deployment development will start with multinode xCache cluster (so not yet). Will have to discuss things with Andy how to organize cluster and it's storage.
  
  At TCB meeting presented effects of pCache. ARC site claims ~80% cache hit rate from 250TB LRU cache!
  
  Developing code to simulate different caching configurations.
- 13:45 → 13:50
  
  HPCs integration 5m
  
  Minutes
  
  Speaker: Taylor Childers (Argonne National Laboratory (US))
  
  NERSC using Harvester + Minipilot. Working smoothly. Allocation completely exhausted this past week. Containers being used for software distribution.
  
  ALCF using Harvester + Yoda. Lots of debug work ongoing related to PanDA settings, JumboJobs, Athena performance improvements, etc. Aiming for mid-May to have Yoda tested, validated and ready for production jobs. Singularity containers now being used for software distribution.
  
  OLCF testing Harvester + minipilot, but not yet in production. Aiming for mid-May for Harvester online. No ETA for containers in production, but no obvious hang ups to deploying them at this point.
- 13:50 → 14:25
  Site Reports
  - 13:50
    BNL 5m
    
    Minutes
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    increase of LOCALGROUPDISK space to 1PB, to help with decommissioning SE on small sites
    
    as the new TOR switches are in place, migration of the rest of the farm to SL7 started this week, in a rolling fashion, some old WNs will be removed as well. The whole process is expected to finish by late May.
    
    OSG migration
    
    working with ITD on joining InCommon CA
    
    working with GGUS on interfacing GGUS with BNL RT directly.
  - 13:55
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    All UM C6420 sleds are now powered up. All but 4 are now in HTCondor production, with the last 4 now ready with power on the PDUs now balanced out. All give consistent HS06 results.
    
    MSU sleds still awaiting switch reconfiguration.
    
    All operations running normally.
  - 14:00
    MWT2 5m
    
    Minutes
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    Overall the site is performing well and is full of jobs
    
    UC
    
    Elasticsearch: new ES hardware now online
    
    IU
    
    C6420 deployment: working on getting the first C6420 built
    
    UIUC
    
    no new updates
  - 14:05
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Our OSG 3.4 / LCMAPS / HTCONDOR upgrade is done.
    
    Our HTCONDOR problems have not appeared at Harvard or BU since the upgrade... so far. Fingers crossed.
    
    We're ready to migrate away from GRAM. Coordinating with Jose and John Hover.
    
    Installed Wei's version of Gridftp with callout to our Adler32 code. Works fine.
    
    Next up is migrating away from Bestman.
    
    LHCONE peering resumed successfully after replacing a bad card at MANLAN.
    
    NET3 "Northeast Tier 3" slowly growing. UMASS/Amherst buy-in ordered.
    
    We strangly had a very high rate of deletion for 2-3 days, causing SRM stress and a couple of trouble tickets. We made some adjustments and then the deletion rate also mysteriously went down by a factor of 5 or so.
    
    There is lots of NESE activity. 10 PB raw deployment ordered.
    
    SL7 migration coming soon.
  - 14:10
    
    SWT2-OU 5m
    
    Minutes
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - Currently in OSCER scheduled maintenance till this evening
    
    - Lucille cooling failure, running with reduced capacity
    
    - Had OU network problem last week, fixed
    
    - Experienced 50% DDM transfer failures earlier this week, which was tracked down to checksum timeouts. Extended timeout for that in our gridftp server, which fixed failures. Wei knows details.
  - 14:15
    SWT2-UTA 5m
    
    Minutes
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA_SWT2
    
    Will try to update production CE to HTCondor
    
    SWT2_CPB
    
    Still scaling HTCondor, Identified an issue with SAM tests
    
    XrootD issue with long pathnames still exists and not replicated by Wei, will investigate further.
  - 14:20
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:25 → 14:30
  
  AOB 5m
- 14:30 → 14:40
  
  Singularity / centos 7 deployment in the US cloud 10m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

Choose timezone

US ATLAS Computing Integration and Operations

Share this page

Direct link

Social networks

Calendaring